scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-11-29 14:10:24 +00:00

Author	SHA1	Message	Date
Tejun Heo	70088fd7da	Merge pull request #63 from sched-ext/userland_updates Userland scheduler updates	2024-01-03 10:15:18 +09:00
David Vernet	e8978ebe23	scx_userland: Introduce ops.update_idle() callback We can sometimes hit scenarios in the scx_userland scheduler where there is work to be done in user space, but we incorrectly fail to run the user space scheduler. In order to avoid this, we can use global variables that are set from both BPF and user space. The BPF-side variable reflects when one or more tasks have been enqueued, and the user space-side variable reflects when user space has received tasks but has not yet dispatched them. In the ops.update_idle() callback, we can check these variables and send a resched IPI to a core to ensure that the user-space scheduler is always scheduled when there's work to be done. Signed-off-by: David Vernet <void@manifault.com>	2024-01-02 16:29:19 -06:00
David Vernet	620abac46f	Merge pull request #62 from arighi/c-include-portability scheds: c: improve build portability	2024-01-02 11:51:43 -06:00
Andrea Righi	bcbce040b6	scheds: c: improve build portability Improve build portability by including asm-generic/errno.h, instead of linux/errno.h. The difference between these two headers can be summarized as following: - asm-generic/errno.h contains generic error code definitions that are intended to be common across different architectures, - linux/errno.h includes architecture-specific error codes and provides additional (or overrides) error code definitions based on the specific architecture where the code is compiled. Considering the architecture-independent nature of scx, the advantages of being able to use architecture-specific error codes are marginal or negligible (and we should probably discourage using them). Moving towards asm-generic/errno.h, however, allows the removal of cross-compilation dependencies (such as the gcc-multilib package in Debian/Ubuntu) and improves the code portability across various architectures and distributions. This also allows to remove a symlink hack from the github workflow. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-02 17:39:46 +01:00
David Vernet	d05c7cf6c3	Merge pull request #51 from arighi/virtme-ng-github-workflow test the schedulers in the github workflow using virtme-ng	2024-01-02 08:43:54 -06:00
Tejun Heo	54363be254	Merge pull request #61 from arighi/scx-rustland-exiting-tasks scx_rustland: notify user-space scheduler about exiting tasks	2024-01-02 21:59:36 +09:00
Andrea Righi	a09482f0ef	scx_rustland: notify user-space scheduler about exiting tasks Instead of implementing a garbage collector to periodically free up exiting tasks' resources, implement a proper synchronous mechanism to notify the user-space scheduler about the exiting tasks from the BPF component, using the .disable() callback. When the user-space scheduler receives a queued task with a negative CPU number, it can then release all the resources associated with that task (which currently includes only the entry in the TaskInfoMap for now). This allows to get rid of the TaskInfoMap periodic garbage collector routine, save a lot of syscalls in procfs (used to check if the pids were still alive), and improve the overall scheduler performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-02 12:57:27 +01:00
Tejun Heo	6c437ae2b6	Merge pull request #60 from arighi/scx-rustland-prevent-starvation scx_rustland: prevent starvation and improve responsiveness	2024-01-02 13:46:20 +09:00
Andrea Righi	280796c4bd	scx_rustland: small code refactoring No functional change, make the user-space scheduler code a bit more readable and more Rust idiomatic. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 19:47:30 +01:00
Andrea Righi	2900b208fe	scx_rustland: evaluate the proper vruntime delta The forumla used to evaluate the weighted time delta is not correct, it's not considering the weight as a percentage. Fix this by using the proper formula. Moreover, take into account also the task weight when evaluating the maximum time delta to account in vruntime and make sure that we never charge a task more than slice_ns. This helps to prevent starvation of low priority tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 19:47:30 +01:00
Andrea Righi	90e92ace2d	scx_rustland: prevent starvation handling short-lived tasks properly Prevent newly created short-lived tasks from starving the other tasks sitting in the user-space scheduler. This can be done setting an initial vruntime of (min_vruntime + 1) to newly scheduled tasks, instead of min_vruntime: this ensures a progressing global vruntime durig each scheduler run, providing a priority boost to newer tasks (that is still beneficial for potential short-lived tasks) while also preventing excessive starvation of the other tasks sitting in the user-space scheduler, waiting to be dispatched. Without this change it is really easy to create a stall condition simply by forking a bunch of short-lived tasks in a busy loop, with this change applied the scheduler can handle properly the consistent flow of newly created short-lived tasks, without introducing any stall. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 16:58:28 +01:00
Andrea Righi	676bd88ada	bpf_rustland: do not dispatch the scheduler to the global DSQ Never dispatch the user-space scheduler to the global DSQ, while all the other tasks are dispatched to the local per-CPU DSQ. Since tasks are consumed from the local DSQ first and then from the global DSQ, we may end up starving the scheduler if we dispatch only this one on the global DSQ. In fact it is really easy to trigger a stall with a workload that triggers many context switches in the system, for example (on a 8 cores system): $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s ... 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s 09:28:11 [INFO] Unregister RustLand scheduler To prevent this from happening also dispatch the user-space scheduler on the local DSQ, using the current CPU where .dispatch() is called, if possible, or the previously used CPU otherwise. Apply the same logic when the scheduler is congested: dispatch on the previously used CPU using the local DSQ. In this way all tasks will always get the same "dispatch priority" and we can prevent the scheduler starvation issue. Note that with this change in place dispatch_global() is never used and we can get rid of it. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	0fc46b2be2	scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available() With commit `49f2e7c` ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space scheduler activations when the system is mostly idle. We are also checking for the SCX_ENQ_LAST in the enqueue flags, that apparently it is not needed and we can achieve the same behavior dropping this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	840260141d	scx_rustland: never account more than slice_ns to vruntime In any case make sure that we never account more than the maximum slice_ns to a task's vruntime. This helps to prevent starving a task for too long in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	61c77b7d87	scx_rustland: clean up old entries in the task map The user-space scheduler maintains an internal hash map of tasks information (indexed by their pid). Tasks are only added to this hash map and never removed. After running the scheduler for a while we may experience a performance degration, because the hash map keeps growing. Therefore implement a mechanism of garbage collector to remove the old entries from the task map (periodically removing pids that don't exist anymore). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	27739065bc	scx_rustland: rename variable id -> pos for better clarity Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Tejun Heo	70803d5e14	Merge pull request #59 from arighi/lowlatency-improvements scx_rustland: lowlatency improvements	2024-01-01 06:14:50 +09:00
Andrea Righi	1cdcb8af60	scx_rustland: show the CPU where the scheduler is running In the scheduler statistics reported periodically to stdout, instead of showing "pid=0" for the CPU where the scheduler is running (like an idle CPU), show "[self]". This helps to identify exactly where the user-space scheduler is running (when and where it migrates, etc.). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 17:03:30 +01:00
Andrea Righi	a7677fdf28	scx_rustland: bypass user-space scheduler for short-lived kthreads Bypass the user-space scheduler for kthreads that still have more than half of their runtime budget. As they are likely to release the CPU soon, granting them a substantial priority boost can enhance the overall system performance. In the event that one of these kthreads turns into a CPU hog, it will deplete its runtime budget and therefore it will be scheduled like any other normal task through the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:40:05 +01:00
Andrea Righi	405a11308e	scx_rustland: always use dispatch_on_cpu() when possible Use dispatch_on_cpu() when possible, so that all tasks dispatched by the user-space scheduler gets the same priority, instead of having some of them dispatched to the global DSQ and others dispatched to the per-CPU DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:08:31 +01:00
Andrea Righi	49f2e7ce06	scx_rustland: enable SCX_OPS_ENQ_LAST Make sure the scheduler is not activated if we are deadling with the last task running. This allows to consistency reduce scx_rustland CPU usage in systems that are mostly idle (and avoid unnecessary power consumption). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:06:45 +01:00
Tejun Heo	804180a74a	Merge pull request #58 from arighi/scx-rustland-improve-idle-cpu-assignment scx_rustland: prevent dispatching multiple tasks on the same idle cpu	2023-12-31 18:00:47 +09:00
Andrea Righi	0522219bea	scx_rustland: prevent dispatching multiple tasks on the same idle cpu When a task is dispatched we always try to pick the previously used CPU (if idle) to minimize the migration overhead. Alternatively, if such CPU is not available, we pick any other idle CPU in the system. However, we don't update the list of idle CPUs as we dispatch tasks, therefore we may end up sending multiple tasks to the same idle CPU (if their previously used CPU is the same) and we may even skip some idle CPUs completely. Change this logic to make sure that we never dispatch multiple tasks to the same idle CPU, by updating the list of idle CPUs as we send tasks to the BPF dispatcher. This also avoids dispatching tasks with a closely matched vruntime to the same CPU, thereby negating the advantages of the vruntime ordering. With this change in place, we ensure that tasks with a similar vruntime are dispatched to different CPUs, leading to significant improvements in latency performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 09:37:39 +01:00
Tejun Heo	641f9b76e9	Merge pull request #57 from arighi/scx-rustland-improve-cpu-selection scx_rustland: improve scheduler cpu selection	2023-12-30 21:56:48 +09:00
Andrea Righi	38145f8dc9	scx_rustland: check CPU selection validity When the scheduler decides to assign a different CPU to the task always make sure the assignment is valid according to the task cpumask. If it's not valid simply dispatch the task to the global DSQ. This prevents the scheduler from exiting with errors like this: 09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718] In the future we may want move this check directly into the user-space scheduler, but for now let's keep this check in the BPF dispatcher as a quick fix. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-30 10:40:46 +01:00
Andrea Righi	1a2c9f5fd4	scx_rustland: improve scheduler's idle CPU selection The current CPU selection logic in the scheduler presents some inefficiencies. When a task is drained from the BPF queue, the scheduler immediately checks whether the CPU previously assigned to the task is still idle, assigning it if it is. Otherwise, it iterates through available CPUs, always starting from CPU #0, and selects the first idle one without updating its state. This approach is consistently applied to the entire batch of tasks drained from the BPF queue, resulting in all of them being assigned to the same idle CPU (also with a higher likelihood of allocation to lower CPU ids rather than higher ones). While dispatching a batch of tasks to the same idle CPU is not necessarily problematic, a fairer distribution among the list of idle CPUs would be preferable. Therefore change the CPU selection logic to distribute tasks equally among the idle CPUs, still maintaining the preference for the previously used one. Additionally, apply the CPU selection logic just before tasks are dispatched, rather than assigning a CPU when tasks are drained from the BPF queue. This adjustment is important, because tasks may linger in the scheduler's internal structures for a bit and the idle state of the CPUs in the system may change during that period. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-30 10:34:08 +01:00
Tejun Heo	474a14970e	Merge pull request #56 from arighi/scx-rustland-reduce-scheduler-overhead scx_rustland: reduce scheduler overhead	2023-12-30 08:02:09 +09:00
Andrea Righi	e90bc923f9	scx_rustland: introduce nr_waiting concept We want to activate the user-space scheduler only when there are pending tasks that require scheduling actions. To do so we keep track of the queued tasks via nr_queued, that is incremented in .enqueue() when a task is sent to the user-space scheduler and decremented in .dispatch() when a task is dispatched. However, we may trigger an unbalance if the same pid is sent to the scheduler multiple times (because the scheduler store all the tasks by their unique pid). When this happens nr_queued is never decremented back to 0, leading the user-space scheduler to constantly spin, even if there's no activity to do. To prevent this from happening split nr_queued into nr_queued and nr_scheduled. The former will be updated by the BPF component every time that a task is sent to the scheduler and it's up to the user-space scheduler to reset the counter when the queue is fully dreained. The latter is maintained by the user-space scheduler and represents the amount of tasks that are still processed by the scheduler and are waiting to be dispatched. The sum of nr_queued + nr_scheduled will be called nr_waiting and we can rely on this metric to determine if the user-space scheduler has some pending work to do or not. This change makes rust_rustland more reliable and it strongly reduces the CPU usage of the user-space scheduler by eliminating a lot of unnecessary activations. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 21:15:04 +01:00
Andrea Righi	d67dfe50f9	scx_rustland: treat the CPU running the user-space scheduler as idle Considering the CPU where the user-space scheduler is running as busy doesn't really provide any benefit, since the user-space scheduler is constantly dispatching an amount of tasks equal to the amount of idle CPUs and then yields (therefore its own CPU should be considered idle). Considering the CPU where the user-space scheduler is running as busy doesn't provide any benefit, as the scheduler consistently dispatches tasks equal to the number of idle CPUs and then yields (therefore its own CPU should be considered idle). This also allows to reduce the overall user-space scheduler CPU utilization, especially when the system is mostly idle, without introducing any measurable performance regression. Measuring the average CPU utilization of a (mostly) idle system over a time period of 60 sec: - wihout this patch: 5.41% avg cpu util - with this patch: 2.26% avg cpu util Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 21:14:58 +01:00
Andrea Righi	05f5c69747	ci: use virtme-ng to test the schedulers Use virtme-ng to run the schedulers after they're built; virtme-ng allows to pick an arbitrary sched-ext enabled kernel and run it virtualizing the entire user-space root filesystem, so we can basically exceute the recompiled schedulers inside such kernel. This should allow to catch potential run-time issue in advance (both in the kernel and the schedulers). The sched-ext kernel is taken from the Ubuntu ppa (ppa:arighi/sched-ext) at the moment, since it is the easiest / fastest way to get a precompiled sched-ext kernel to run inside the Ubuntu 22.04 testing environment. The schedulers are tested using the new meson target "test_sched", the specific actions are defined in meson-scripts/test_sched. By default each test has a timeout of 30 sec, after the virtme-ng completes the boot (that should be enough to initialize the scheduler and run the scheduler for some seconds), while the total lifetime of the virtme-ng guest is set to 60 sec, after this time the guest will be killed (this allows to catch potential kernel crashes / hangs). If a single scheduler fails the test, the entire "test_sched" action will be interrupted and the overall test result will be considered a failure. At the moment scx_layered is excluded from the tests, because it requires a special configuration (we should probably pre-generate a default config in the workflow actions and change the scheduler to use the default config if it's executed without any argument). Moreover, scx_flatcg is also temporarily excluded from the tests, because of these known issues: - https://github.com/sched-ext/scx/issues/49 - https://github.com/sched-ext/sched_ext/pull/101 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 15:54:10 +01:00
Andrea Righi	dbc8e23980	scx_userland: flush stdout when printing stats Periodically flush stdout to help following the scheduler progress during testing. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 15:53:12 +01:00
Andrea Righi	614a1ff901	scx_flatcg: flush stdout when printing stats Periodically flush stdout to help following the scheduler progress during testing. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 15:53:12 +01:00
Tejun Heo	3206464405	Merge pull request #55 from arighi/scx-rustland-doc scx_rustland: add documentation to scheds/rust/README.md	2023-12-29 17:35:09 +09:00
Andrea Righi	cc17780c24	scx_rustland: add documentation to scheds/rust/README.md Add documentation for scx_rustland to the README.md files of the Rust schedulers. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 09:13:54 +01:00
Tejun Heo	d2a173fc51	Merge pull request #53 from sched-ext/htejun Suppress the deprecation warning from bindgen and bump versions	2023-12-29 07:07:06 +09:00
Tejun Heo	98773131df	Bump versions to publish scx_utils fedora compat change	2023-12-29 06:58:45 +09:00
Tejun Heo	c47a4b6716	scx_utils: Explain what's going on with bindgen version and suppress deprecation warning This is a followup to https://github.com/sched-ext/scx/pull/50. See the comment in BpfBuilder::bindgen_bpf_intf() for details.	2023-12-29 06:56:07 +09:00
Tejun Heo	1d868dbf89	Merge pull request #50 from jordalgo/downgrade-bindgen Downgrade bindgen to 0.68	2023-12-29 06:28:20 +09:00
Tejun Heo	e230e86272	Merge pull request #52 from arighi/scx-rustland-update-idle scx_rustland: introduce update_idle callback	2023-12-29 06:10:40 +09:00
Andrea Righi	6df4d7e0c6	scx_rustland: introduce an update_idle() callback Move the logic to activate the userspace scheduler to an update_idle() callback, which is called when the CPU is about to go idle. This disables the built-in idle tracking mechanism, so it allows to rely completely on the internal CPU ownership logic (via get_cpu_owner() and set_cpu_owner()) and it also allows to share the idle state with the user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map. Moreover, when the user-space scheduler is activated, kick the idle cpu to trigger immediate dispatch and avoid bubbles in the scheduling pipeline. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-28 14:41:08 +01:00
Andrea Righi	1baae38e7f	Revert "scx_rustland: always dispatch kthreads on the local CPU" This reverts commit `9237e1d` ("scx_rustland: always dispatch kthreads on the local CPU"). Do not always prioritize all kthreads, we may have unbound workqueue workers that can consume a lot of CPU cycles (e.g., encryption workers), so we definitely want to apply the scheduling for those. Therefore, restore the old behavior to prioritize only per-CPU kthreads. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-28 14:40:03 +01:00
Tejun Heo	990cd058fe	Merge pull request #48 from arighi/scx-rustland-userspace-interlocking scx_rustland: clarify and improve BPF / userspace interlocking	2023-12-28 08:26:55 +09:00
Jordan Rome	c8a721b033	Downgrade bindgen to 0.68 This is so we can package scx_utils into fedora without having to upgrade rust-bindgen (https://bodhi.fedoraproject.org/updates/FEDORA-2023-18e7f124e1). To make this happen we need to stop using the `CargoCallbacks::new` constructor which was added in 0.69. Old way seems legit according to the docs: https://rust-lang.github.io/rust-bindgen/non-system-libraries.html	2023-12-27 12:19:28 -08:00
Andrea Righi	9237e1d835	scx_rustland: always dispatch kthreads on the local CPU Adding extra overhead to any kthread can potentially slow down the entire system, so make sure this never happens by dispatching all kthreads directly on the same local CPU (not just the per-CPU kthreads), bypassing the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-27 14:15:46 +01:00
Andrea Righi	f0ece7af6b	scx_rustland: wake-up user-space scheduler when a CPU is released Trigger the user-space scheduler only upon a task's CPU release event (avoiding its activation during each enqueue event) and only if there are tasks waiting to be processed by the user-space scheduler. This should save unnecessary calls to the user-space scheduler, reducing the overall overhead of the scheduler. Moreover, rename nr_enqueues to nr_queued and store the amount of tasks currently queued to the user-space scheduler (that are waiting to be dispatched). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-27 14:15:46 +01:00
Andrea Righi	7d01be9568	scx_rustland: provide get/set_cpu_owner() Provide the following primitives to get and set CPU ownership in the BPF part. This improves code readability and these primitives can be used by the BPF part as a baseline to implement a better CPU idle tracking in the future. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-27 14:15:39 +01:00
Andrea Righi	cd7e1c6248	scx_rustland: clarify BPF / user-space interlocking BPF doesn't have full memory model yet, and while strict atomicity might not be necessary in this context, it is advisable to enhance clarity in the interlocking model. To achieve this, provide the following primitives to operate on usersched_needed: static void set_usersched_needed(void) static bool test_and_clear_usersched_needed(void) Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-26 14:28:24 +01:00
Tejun Heo	8443d8ac16	Merge pull request #47 from arighi/scx-rustland-cpu scx_rustland improvements	2023-12-24 06:29:15 +09:00
Andrea Righi	e038a530ae	scx_rustland: dispatch tasks in batch Dispatch tasks in a batch equal to the amount of idle CPUs in the system. This allows to reduce the pressure on the dispatcher queues, improving the effectiveness of the scheduler (by having more tasks sitting in the scheduler task pool) and mitigating potential priority inversion issues. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-23 10:44:03 +01:00
Andrea Righi	4d98862674	scx_rustland: expose CPU information to the user-space scheduler Provide an interface for the BPF dispatcher and user-space scheduler to share CPU information. This information can empower the user-space scheduler to make more informed decisions and enable the implementation of a broader range of scheduling policies. With this change the BPF dispatcher provides a CPU map (one entry per CPU) that stores the pid that is running on each CPU (0 if the CPU is idle). The CPU map is updated by the BPF dispatcher in the .running() and .stopping() callbacks. The dispatcher then sends to the user-space scheduler a suggestion of the candidate CPU for each task that needs to run (that is always the previously used CPU), along with all the task's information. The user-space scheduler can decide to confirm the selected CPU or to choose a different one, using all the shared CPU information. Lastly, the selected CPU is communicated back to the dispatcher along with all the task's information and the BPF dispatcher takes care of executing the task on the selected CPU, eventually triggering a migration. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-23 10:38:56 +01:00

1 2 3 4

179 Commits