scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-11-29 06:00:23 +00:00

Author	SHA1	Message	Date
Andrea Righi	05d997c539	scx_rustland: more robust CPU selection logic in the dispatcher Instead of just trying the target CPU and the previously used CPU, we could cycle among all the available CPUs (if both those CPUs cannot be used), before using the global DSQ. This allows to not de-prioritize too much tasks that can't be scheduled on the CPU selected by the scheduler (or their previously used CPU), and we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	18a990ae82	scx_rustland: assign min_vruntime before time slice evaluation Assign min_vruntime to the task before the weighted time slice is evaluated, then add the time slice. In this way we still ensure that the task's vruntime is in the range (min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify the effect of the evaluated time slice if the starting vruntime of the task is too small. Also change update_enqueued() to return the evaluated weighted time slice (that can be used in the future). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	92109c95a9	scx_rustland: small TaskTree.push() refactoring Change TaskTree.push() to accept directly a Task object, rather than each individual attribute. Moreover, Task attributes don't need to be public, since both TaskTree and Task are only used locally. This makes the code more elegant and more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Tejun Heo	dfcc52c866	Merge pull request #71 from jordalgo/bump-rust bump scx_rusty and scx_layered	2024-01-05 07:03:49 +09:00
Jordan Rome	661ea57c5c	bump scx_rusty and scx_layered These were supposed to be bumped in this commit: `fed1dae9da`	2024-01-04 13:57:29 -08:00
Andrea Righi	96f3eb42be	Merge pull request #68 from sched-ext/scx-rustland-refactoring scx_rustland: refactoring	2024-01-04 20:42:30 +01:00
Andrea Righi	7813992896	scx_rustland: introduce nr_failed_dispatches Introduce a new counter to report the amount of failed dispatches: if the scheduler designates a target CPU for a task, and both the chosen CPU and the previously utilized one are unavailable when the task is dispatched, the task will be sent to the global DSQ, and the counter will be incremented. Also mark all the methods to access these statistics counters as optional. In the future we may also provide a "verbose" option and show these statistics only when the scheduler runs in verbose mode. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 17:36:06 +01:00
David Vernet	c9494e00ed	Merge pull request #67 from jordalgo/rust-readmes Add README files for each rust scheduler	2024-01-04 09:49:33 -06:00
Andrea Righi	796a7ebc0e	scx_rustland: provide an abstraction layer for the BPF component Move the code responsible for interfacing with the BPF component into its own module and provide high-level abstractions for the user-space scheduler, hiding all the internal BPF implementation details. This makes the user-space scheduler code much more readable and it allows potential developers/contributors that want to focus at the pure scheduling details to modify the scheduler in a generic way, without having to worry about the internal BPF details. In the future we may even decide to provide the BPF abstraction as a separate crate, that could be used as a baseline to implement user-space schedulers in Rust. API overview ============ The main BPF interface is provided by BpfScheduler(). When this object is initialized it will take care of registering and initializing the BPF component. Then the scheduler can use the BpfScheduler() instance to receive tasks (in the form of QueuedTask object) and dispatch tasks (in the form of DispatchedTask objects), using respectively the methods dequeue_task() and dispatch_task(). The CPU ownership map can be accessed using the method get_cpu_pid(), this also allows to keep track of the idle and busy CPUs, with the corrsponding PIDs associated to them. BPF counters and statistics can be accessed using the methods nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be updated to notify the BPF component if the user-space scheduler has some pending work to do or not. Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can be used respectively to read the exit code and exit message from the BPF component, when the scheduler is unregistered. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 16:49:09 +01:00
Jordan Rome	5bacefcdbe	Add README files for each rust scheduler This because each scheduler has it's own Rust Crate and it's better if they had a README associated with each one. https://crates.io/crates/scx_layered	2024-01-04 07:35:44 -08:00
Andrea Righi	7c11837a61	scx_rustland: make dispatcher more robust We always try to use the current CPU (from the .dispatch() callback) to run the user-space scheduler itself and if the current CPU is not usable (according to the cpumask) we just re-use the previouly used CPU. However, if the previously used CPU is also not usable, we may trigger the following error: sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201]) Potentially this can also happen with any task, so improve the dispatch logic as following: - dispatch on the target CPU, if usable - otherwise dispatch on the previously used CPU, if usable - otherwise dispatch on the global DSQ Moreover, rename dispatch_on_cpu() -> dispatch_task() for better clarity. This should be enough to handle all the possible decisions made by the user-space scheduler, making the dispatcher more robust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 10:21:40 +01:00
Andrea Righi	69c1dfc03c	scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check In the dispatch callback we can dispatch tasks to any CPU, according to the scheduler decisions, so there's no reason to check for the available dispatch slots in the current CPU only, to determine if we need to stop dispatching tasks. Since the scheduler is aware of the idle state of the CPUs (via the CPU ownership map) it has all the information to automatically regulate the flow of dispatched tasks and not overflow the dispatch slots, therefore it is safe to remove this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 09:41:54 +01:00
Andrea Righi	6b1e7d927d	scx_rustland: update comments and documentation in the BPF part No functional change, only a little polishing, including updates to comments and documentation to align with the latest changes in the code. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 09:40:49 +01:00
Tejun Heo	c6b9173bf4	Merge pull request #66 from sched-ext/scx-userland-reliability Improve scx_rustland reliability	2024-01-04 10:46:35 +09:00
Andrea Righi	bb1c32d395	scx_rustland: avoid bypassing the scheduler with pending activities While bypassing the user-space scheduler can provide some benefits at reducing the scheduling overhead, doing so underneath the scheduler while it is actively taking decisions may disrupt its work and have a negative effect on the overall system performance. For this reason, activate the logic to bypass the user-space scheduler only when there is no pending work it. This change makes the scheduler much more reliable, for example on a 8-cores system it is really easy to trigger short lockups or even trigger the sched-ext watchdog that kicks out the scheduler, running the following stress test: $ stress-ng -c 128 With this change applied the system remains reasonably responsive and the scheduler is never disabled by the sched-ext watchdog. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 22:54:14 +01:00
Andrea Righi	5d15d34777	scx_rustland: charge additional time slice to new tasks Instead of accounting (max_slice_ns / 2) to the vruntime of all the new tasks, add that to thier regular weighted time delta, as an additional penalty. This allows to distinguish new CPU intensive tasks vs new less CPU intensive tasks, and prioritize the latter over the former. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 22:54:10 +01:00
Andrea Righi	8820af8d36	scx_rustland: enable user-space scheduler to preempt other tasks Use SCX_ENQ_PREEMPT to dispatch the user-space scheduler. This can help to mitigate starvation in presence of many cpu hogs (way more than the amount of available CPUs) running in the system, by giving the scheduler more chances to drain the amount of tasks that may be starving in a waiting state. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 22:54:00 +01:00
Tejun Heo	020ae33fe2	Merge pull request #65 from jordalgo/arch-map-update Add new archs for bpf_builder	2024-01-04 04:02:33 +09:00
Jordan Rome	6caf6c5c99	Add new archs for bpf_builder This is to fix fedora build failures for these archs: s390x and ppc64le Error: ``` ---- bpf_builder::tests::test_bpf_builder_new stdout ---- thread 'bpf_builder::tests::test_bpf_builder_new' panicked at src/bpf_builder.rs:592:9: Failed to create BpfBuilder (Err(CPU arch "s390x" not found in ARCH_MAP)) ``` https://koji.fedoraproject.org/koji/taskinfo?taskID=111114326	2024-01-03 10:50:33 -08:00
David Vernet	9f1a3973d8	Merge pull request #64 from arighi/improve-interactive-workloads scx_rustland: improve interactive workloads	2024-01-03 12:10:26 -06:00
Andrea Righi	5d9182d9c3	scx_rustland: prioritize interactive workloads The current implementation of the user-space scheduler is strongly prioritizing newly created tasks by setting their initial vruntime to (min_vruntime + 1); this prioritization places them ahead of other tasks waiting to run. While this approach is efficient for processing short-lived tasks, it makes the scheduler vulnerable to fork-bomb attacks and significantly penalizes interactive workloads (e.g., "foreground" applications), in particular in the presence of background applications that are spawning multiple tasks, such as parallel builds. Instead of prioritizing newly created tasks, do the opposite and account (max_slice_ns / 2) to their initial vruntime, to make sure they are not scheduled before the other tasks that are already waiting for the CPU in the current scheduler run. This allows to mitigate potential fork-bomb attacks and it strongly improves the responsiveness of interactive applications (such as UI, audio/video streams, gaming, etc.). With this change applied, under certain conditions, scx_rustland can even outperform the default Linux scheduler. For example, with a parallel kernel build (make -j32) running in the background, I can play Terraria with a constant rate of ~30-40 fps, while the default Linux scheduler can handle only ~20-30 fps under the same conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 18:28:54 +01:00
Andrea Righi	50b5f6e8c6	scx_rustland: do not update exiting tasks statistics Avoid updating task information for tasks that are exiting, as they won't be used by the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 09:10:20 +01:00
Andrea Righi	b7a9d3775a	scx_rustland: schedule non-cpu intensive kthreads normally With commit `a7677fd` ("scx_rustland: bypass user-space scheduler for short-lived kthreads") we were try to mitigate a problem that was actually introduced by using the wrong formula to evaluate weighted vruntime, see commit `2900b20` ("scx_rustland: evaluate the proper vruntime delta"). Reverting that (pseudo-)optimization doesn't seem to introduce any performance/latency regression and it makes the code more elegant, therefore drop it. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 07:46:01 +01:00
Tejun Heo	70088fd7da	Merge pull request #63 from sched-ext/userland_updates Userland scheduler updates	2024-01-03 10:15:18 +09:00
David Vernet	e8978ebe23	scx_userland: Introduce ops.update_idle() callback We can sometimes hit scenarios in the scx_userland scheduler where there is work to be done in user space, but we incorrectly fail to run the user space scheduler. In order to avoid this, we can use global variables that are set from both BPF and user space. The BPF-side variable reflects when one or more tasks have been enqueued, and the user space-side variable reflects when user space has received tasks but has not yet dispatched them. In the ops.update_idle() callback, we can check these variables and send a resched IPI to a core to ensure that the user-space scheduler is always scheduled when there's work to be done. Signed-off-by: David Vernet <void@manifault.com>	2024-01-02 16:29:19 -06:00
David Vernet	620abac46f	Merge pull request #62 from arighi/c-include-portability scheds: c: improve build portability	2024-01-02 11:51:43 -06:00
Andrea Righi	bcbce040b6	scheds: c: improve build portability Improve build portability by including asm-generic/errno.h, instead of linux/errno.h. The difference between these two headers can be summarized as following: - asm-generic/errno.h contains generic error code definitions that are intended to be common across different architectures, - linux/errno.h includes architecture-specific error codes and provides additional (or overrides) error code definitions based on the specific architecture where the code is compiled. Considering the architecture-independent nature of scx, the advantages of being able to use architecture-specific error codes are marginal or negligible (and we should probably discourage using them). Moving towards asm-generic/errno.h, however, allows the removal of cross-compilation dependencies (such as the gcc-multilib package in Debian/Ubuntu) and improves the code portability across various architectures and distributions. This also allows to remove a symlink hack from the github workflow. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-02 17:39:46 +01:00
David Vernet	d05c7cf6c3	Merge pull request #51 from arighi/virtme-ng-github-workflow test the schedulers in the github workflow using virtme-ng	2024-01-02 08:43:54 -06:00
Tejun Heo	54363be254	Merge pull request #61 from arighi/scx-rustland-exiting-tasks scx_rustland: notify user-space scheduler about exiting tasks	2024-01-02 21:59:36 +09:00
Andrea Righi	a09482f0ef	scx_rustland: notify user-space scheduler about exiting tasks Instead of implementing a garbage collector to periodically free up exiting tasks' resources, implement a proper synchronous mechanism to notify the user-space scheduler about the exiting tasks from the BPF component, using the .disable() callback. When the user-space scheduler receives a queued task with a negative CPU number, it can then release all the resources associated with that task (which currently includes only the entry in the TaskInfoMap for now). This allows to get rid of the TaskInfoMap periodic garbage collector routine, save a lot of syscalls in procfs (used to check if the pids were still alive), and improve the overall scheduler performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-02 12:57:27 +01:00
Tejun Heo	6c437ae2b6	Merge pull request #60 from arighi/scx-rustland-prevent-starvation scx_rustland: prevent starvation and improve responsiveness	2024-01-02 13:46:20 +09:00
Andrea Righi	280796c4bd	scx_rustland: small code refactoring No functional change, make the user-space scheduler code a bit more readable and more Rust idiomatic. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 19:47:30 +01:00
Andrea Righi	2900b208fe	scx_rustland: evaluate the proper vruntime delta The forumla used to evaluate the weighted time delta is not correct, it's not considering the weight as a percentage. Fix this by using the proper formula. Moreover, take into account also the task weight when evaluating the maximum time delta to account in vruntime and make sure that we never charge a task more than slice_ns. This helps to prevent starvation of low priority tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 19:47:30 +01:00
Andrea Righi	90e92ace2d	scx_rustland: prevent starvation handling short-lived tasks properly Prevent newly created short-lived tasks from starving the other tasks sitting in the user-space scheduler. This can be done setting an initial vruntime of (min_vruntime + 1) to newly scheduled tasks, instead of min_vruntime: this ensures a progressing global vruntime durig each scheduler run, providing a priority boost to newer tasks (that is still beneficial for potential short-lived tasks) while also preventing excessive starvation of the other tasks sitting in the user-space scheduler, waiting to be dispatched. Without this change it is really easy to create a stall condition simply by forking a bunch of short-lived tasks in a busy loop, with this change applied the scheduler can handle properly the consistent flow of newly created short-lived tasks, without introducing any stall. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 16:58:28 +01:00
Andrea Righi	676bd88ada	bpf_rustland: do not dispatch the scheduler to the global DSQ Never dispatch the user-space scheduler to the global DSQ, while all the other tasks are dispatched to the local per-CPU DSQ. Since tasks are consumed from the local DSQ first and then from the global DSQ, we may end up starving the scheduler if we dispatch only this one on the global DSQ. In fact it is really easy to trigger a stall with a workload that triggers many context switches in the system, for example (on a 8 cores system): $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s ... 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s 09:28:11 [INFO] Unregister RustLand scheduler To prevent this from happening also dispatch the user-space scheduler on the local DSQ, using the current CPU where .dispatch() is called, if possible, or the previously used CPU otherwise. Apply the same logic when the scheduler is congested: dispatch on the previously used CPU using the local DSQ. In this way all tasks will always get the same "dispatch priority" and we can prevent the scheduler starvation issue. Note that with this change in place dispatch_global() is never used and we can get rid of it. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	0fc46b2be2	scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available() With commit `49f2e7c` ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space scheduler activations when the system is mostly idle. We are also checking for the SCX_ENQ_LAST in the enqueue flags, that apparently it is not needed and we can achieve the same behavior dropping this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	840260141d	scx_rustland: never account more than slice_ns to vruntime In any case make sure that we never account more than the maximum slice_ns to a task's vruntime. This helps to prevent starving a task for too long in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	61c77b7d87	scx_rustland: clean up old entries in the task map The user-space scheduler maintains an internal hash map of tasks information (indexed by their pid). Tasks are only added to this hash map and never removed. After running the scheduler for a while we may experience a performance degration, because the hash map keeps growing. Therefore implement a mechanism of garbage collector to remove the old entries from the task map (periodically removing pids that don't exist anymore). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	27739065bc	scx_rustland: rename variable id -> pos for better clarity Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Tejun Heo	70803d5e14	Merge pull request #59 from arighi/lowlatency-improvements scx_rustland: lowlatency improvements	2024-01-01 06:14:50 +09:00
Andrea Righi	1cdcb8af60	scx_rustland: show the CPU where the scheduler is running In the scheduler statistics reported periodically to stdout, instead of showing "pid=0" for the CPU where the scheduler is running (like an idle CPU), show "[self]". This helps to identify exactly where the user-space scheduler is running (when and where it migrates, etc.). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 17:03:30 +01:00
Andrea Righi	a7677fdf28	scx_rustland: bypass user-space scheduler for short-lived kthreads Bypass the user-space scheduler for kthreads that still have more than half of their runtime budget. As they are likely to release the CPU soon, granting them a substantial priority boost can enhance the overall system performance. In the event that one of these kthreads turns into a CPU hog, it will deplete its runtime budget and therefore it will be scheduled like any other normal task through the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:40:05 +01:00
Andrea Righi	405a11308e	scx_rustland: always use dispatch_on_cpu() when possible Use dispatch_on_cpu() when possible, so that all tasks dispatched by the user-space scheduler gets the same priority, instead of having some of them dispatched to the global DSQ and others dispatched to the per-CPU DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:08:31 +01:00
Andrea Righi	49f2e7ce06	scx_rustland: enable SCX_OPS_ENQ_LAST Make sure the scheduler is not activated if we are deadling with the last task running. This allows to consistency reduce scx_rustland CPU usage in systems that are mostly idle (and avoid unnecessary power consumption). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:06:45 +01:00
Tejun Heo	804180a74a	Merge pull request #58 from arighi/scx-rustland-improve-idle-cpu-assignment scx_rustland: prevent dispatching multiple tasks on the same idle cpu	2023-12-31 18:00:47 +09:00
Andrea Righi	0522219bea	scx_rustland: prevent dispatching multiple tasks on the same idle cpu When a task is dispatched we always try to pick the previously used CPU (if idle) to minimize the migration overhead. Alternatively, if such CPU is not available, we pick any other idle CPU in the system. However, we don't update the list of idle CPUs as we dispatch tasks, therefore we may end up sending multiple tasks to the same idle CPU (if their previously used CPU is the same) and we may even skip some idle CPUs completely. Change this logic to make sure that we never dispatch multiple tasks to the same idle CPU, by updating the list of idle CPUs as we send tasks to the BPF dispatcher. This also avoids dispatching tasks with a closely matched vruntime to the same CPU, thereby negating the advantages of the vruntime ordering. With this change in place, we ensure that tasks with a similar vruntime are dispatched to different CPUs, leading to significant improvements in latency performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 09:37:39 +01:00
Tejun Heo	641f9b76e9	Merge pull request #57 from arighi/scx-rustland-improve-cpu-selection scx_rustland: improve scheduler cpu selection	2023-12-30 21:56:48 +09:00
Andrea Righi	38145f8dc9	scx_rustland: check CPU selection validity When the scheduler decides to assign a different CPU to the task always make sure the assignment is valid according to the task cpumask. If it's not valid simply dispatch the task to the global DSQ. This prevents the scheduler from exiting with errors like this: 09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718] In the future we may want move this check directly into the user-space scheduler, but for now let's keep this check in the BPF dispatcher as a quick fix. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-30 10:40:46 +01:00
Andrea Righi	1a2c9f5fd4	scx_rustland: improve scheduler's idle CPU selection The current CPU selection logic in the scheduler presents some inefficiencies. When a task is drained from the BPF queue, the scheduler immediately checks whether the CPU previously assigned to the task is still idle, assigning it if it is. Otherwise, it iterates through available CPUs, always starting from CPU #0, and selects the first idle one without updating its state. This approach is consistently applied to the entire batch of tasks drained from the BPF queue, resulting in all of them being assigned to the same idle CPU (also with a higher likelihood of allocation to lower CPU ids rather than higher ones). While dispatching a batch of tasks to the same idle CPU is not necessarily problematic, a fairer distribution among the list of idle CPUs would be preferable. Therefore change the CPU selection logic to distribute tasks equally among the idle CPUs, still maintaining the preference for the previously used one. Additionally, apply the CPU selection logic just before tasks are dispatched, rather than assigning a CPU when tasks are drained from the BPF queue. This adjustment is important, because tasks may linger in the scheduler's internal structures for a bit and the idle state of the CPUs in the system may change during that period. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-30 10:34:08 +01:00
Tejun Heo	474a14970e	Merge pull request #56 from arighi/scx-rustland-reduce-scheduler-overhead scx_rustland: reduce scheduler overhead	2023-12-30 08:02:09 +09:00

1 2 3 4 5

202 Commits