JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-11-26 19:30:24 +00:00

Author	SHA1	Message	Date
Daniel Hodges	bab6e9523c	scx_rusty: Add mempolicy checks to rusty This change makes scx_rusty mempolicy aware. When a process uses set_mempolicy it can change NUMA memory preferences and cause performance issues when tasks are scheduled on remote NUMA nodes. This change modifies task_pick_domain to use the new helper method that returns the preferred node id. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 08:11:19 -07:00
Changwoo Min	971bb2e024	scx_lavd: pretty formatting for ineligible duration Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:54:15 +09:00
Changwoo Min	adfbf3934c	scx_lavd: tuning the max ineligible duration Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:52:23 +09:00
Changwoo Min	eff444516f	scx_lavd: directly measure service time for eligibility enforcement Estimating the service time from run time and frequency is not incorrect. However, it reacts slowly to sudden changes since it relies on the moving average. Hence, we directly measure the service time to enforce fairness. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:48:26 +09:00
I Hsin Cheng	1c3b563caf	scx_rusty: Pre-check task domain mask with pull domain mask Instead of performing domain mask checking inside "find_first_candidate()" every time, check whether the tasks within push domain are abled to run on pull domain by performing the mask check at vector generation stage. This way can also avoid repeated computation generated by the same (task, pull_dom) pair as they'll try to check whether the pull domain is in the task domain mask. Also since whether a task is a kworker won't change in time, we can perform the check earlier and put it in the filter, too. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-16 21:48:06 +08:00
Tejun Heo	51334b5c4d	Bump versions for 1.0.1 release	2024-07-15 13:21:52 -10:00
Andrea Righi	8e7a526356	scx_bpfland: use nr_cpu_ids for consistency We always use nr_cpu_ids to represent the maximum CPU id returned by scx_bpf_nr_cpu_ids(). Replace cpu_max with nr_cpu_ids to be more consistent with the rest of the code. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 08:44:35 +02:00
Andrea Righi	33d06f653b	scx_bpfland: get rid of the MAX_CPUS hard-coded limit We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU DSQs, eliminating the need for the hard-coded limit MAX_CPUS. In this way scx_bpfland can support the same amount of CPUs that the kernel can handle. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:30 +02:00
Andrea Righi	b80ef7d8eb	scx_bpfland: optimize offline CPU handling Instead of constantly checking the need to drain tasks from the DSQs of the offline CPUs, provide an atomic flag to notify when there are tasks to be drained from the offline CPUs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:23 +02:00
Andrea Righi	0530706710	scx_bpfland: properly initialize the nvcsw metrics Initialize the number of voluntary context switches metrics in the local task storage. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:16:10 +02:00
Andrea Righi	bf4ad23599	scx_bpfland: refine interactive tasks flood safeguard Refine the safeguard mechanism to avoid generating too many interactive tasks in the system, which could nullify the effect of the interactive/regular task classification. The safeguard mechanism operates by pausing the promotion of new tasks to interactive status during the task wake-up process, whenever the number of interactive tasks in the priority queue exceeds a specific limit (set to 4x the number of online CPUs). Halting the promotion of additional interactive tasks allows to prioritize those already classified as interactive, thereby preventing potential "bursts" of excessive interactive tasks in the system. This refines the mitigation already provided by commit `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost"). Fixes: `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost") Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:11:34 +02:00
Andrea Righi	eb1cf0e670	scx_bpfland: improve task time slice evaluation Always assign the maximum time slice if there are idle CPUs in the system. Otherwise, double the task's unused time slice to reward tasks that use less CPU time and at the same time refill the time slice of the tasks every time they're dispatched. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-14 23:24:24 +02:00
Tejun Heo	3ae76acd12	Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0 Sync to upstream kernel and bump to 1.0	2024-07-14 07:00:38 -10:00
Changwoo Min	5b2112dd81	Merge pull request #421 from multics69/lavd-metrics scx_lavd: improve time slice and waker freq calculation	2024-07-14 18:49:36 +09:00
Tejun Heo	761ec142ce	Bump most versions to 1.0.0 sched_ext is about to be merged upstream. There are some compatibility breaking changes and we're making the current sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") the baseline. Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early development and is currently temporarily disabled, only the patchlevel is bumped.	2024-07-12 11:34:14 -10:00
Tejun Heo	f261d0f037	Sync from kernel - 1edab907b57d Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11 - cgroup support hasn't landed in the upstream kernel yet. This most likely will happen in a few weeks. For the time being, disable scx_flatcg, scx_pair and scx_mitosis. - Compat macro for DSQ task iterator dropped. This is now a part of the baseline. - scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being discussed. Dropped example usage from tools/sched_ext. None of the practical schedulers use it, so this should be fine for now. - scx_bpf_cpu_rq() added. - AUTOATTACH workaround for newer libbpf versions added.	2024-07-12 11:08:41 -10:00
Changwoo Min	512bd143a5	scx_lavd: count only related tasks in calculating waker_freq A task can become a runnable on any task's context not only its waker task. Thus, we should not count wake-up on unrelated task's context. With this commit, the scheduler can (much more) accurately detect waker-wakee relationsships. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 22:51:09 +09:00
Changwoo Min	95733f63ab	scx_lavd: calculate time slice as a function of run queue length The prior approach using the sum of weights gives too much penalty to nice tasks with large nice values. With this commit, the time slice is determined by the number of runnable tasks regardless of nice priority. Note that the fairness will still be enforced based on tasks' nice priorities (weights). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 22:45:22 +09:00
Changwoo Min	00fdc1d949	Merge pull request #417 from multics69/lavd-vdeadline scx_lavd: improve virtual deadline and current clock handling	2024-07-12 14:05:44 +09:00
Changwoo Min	d4bc92bea7	scx_lavd: print lat_cri to output Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 13:23:56 +09:00
Changwoo Min	4c5c564523	scx_lavd: initial current logical clock to zero To easily distinguish, let's initialize the current logical clock to zero (not the current physical time). Also, avoid the deadline calculation being zero by adding +1 here and there. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 10:15:54 +09:00
Andrea Righi	640bd562ff	scx_bpfland: prevent tasks from abusing interactive priority boost The priority boost for interactive tasks can be exploited to render the system nearly unresponsive by creating numerous tasks that constantly switch between wait/wakeup states. For example, stress tests like `hackbench -l 10000` can significantly degrade system responsiveness. To mitigate this, limit the number of interactive tasks added to the priority queue to 4x the number of online CPUs. This simple approach appears to be a quite effective at identifying potential spam of "fake" interactive tasks, while still prioritizing legitimate interactive tasks. Additionally, periodically refresh the interactive status of the tasks based on their most recent average of voluntary context switches, preventing the interactive status from being too "sticky". Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:55 +02:00
Andrea Righi	1babb2b92d	scx_bpfland: prevent per-CPU kthreads starving other tasks Avoid dispatching per-CPU kthreads directly, since this may cause interactivity problems or unfairness, for example if there are too many softirqs being scheduled (e.g., in presence of high RX network traffic or when running certain stress tests, like hackbench). Moreover, in order to help with testing and benchmarks, introduce the option --local-kthread, that allows to restore the old behavior if enabled. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:48 +02:00
Andrea Righi	c3ebdd338f	scx_bpfland: prevent slice delta overflow When updating the task vruntime, ensure the time slice delta is always a positive value. Failing to do so may cause the global vruntime to increase excessively due to overflows. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	f59aa52fe7	scx_bpfland: expose the amount of online CPUs Periodically report the amount of online CPUs to stdout. The online CPUs are initially evaluated looking at the online cpumask, then the value is updated in the .cpu_offline() / .cpu_online() callbacks. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	3a47b484af	scx_bpfland: report interactive tasks to stdout Keep track of the CPUs that are running interactive tasks and report their amount to stdout. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	1a1a16b9e9	scx_bpfland: fix typo in slice_ns definition The correct default value of slice_ns 5ms, not 5s. This change doesn't really make any difference in practice, since these values are changed by the Rust part when the scheduler is started, but it's good to keep this aligned to the proper values for consistency. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Changwoo Min	bdbfeb9fd1	scx_lavd: use logical current clock for virtual deadlines This commit changes the use of a physical clock to a virtual, logical clock in calculating deadlines. - The virtual current clock advances upon a task's running to its virtual deadline. - When enqueuing a task, its virtual deadline from the virtual current clock is calculated. With the above two changes, this guarantees that there is no such task whose virtual deadline is smaller than the virtual current clock. This means any enqueuing task can compete with any other already enqueued tasks. This allows a latency-critical task to be immediately scheduled if needed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 22:41:56 +09:00
Changwoo Min	408ea7892c	scx_lavd: induce sched_prio_to_latency_weight from slice weight So sched_prio_to_latency_weight is removed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:37:21 +09:00
Changwoo Min	bd964acff6	scx_lavd: deprioritize a newly forked task in latency Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:36:32 +09:00
Changwoo Min	48debe416e	scx_lavd: tuning the deadline equation under high load Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:54 +09:00
Changwoo Min	c72e063680	scx_lavd: do not use lat_prio_to_greedy_thresholds With other optimizations, lat_prio_to_greedy_thresholds is not effective any more. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:01 +09:00
Changwoo Min	9ed488798e	scx_lavd: use task's runtime to determine its deaddline It has an effect of further perferring shorter jobs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:34:25 +09:00
Changwoo Min	e081b2a294	scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:33:56 +09:00
Andrea Righi	995577762a	scx_bpfland: refill task time slice Every time we need to dispatch a task re-evalate its time slice as: (unused_time_slice + min_time_slice) / 2 This allows to refill the time slice for tasks that haven't used much of their previously assigned time, improving fairness. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	6a64182ef2	scx_bpfland: always classify interactive tasks Make sure to always classify interactive tasks, even when the system is not fully utilized. This ensures that if the system suddenly becomes overloaded, we already know which tasks need to be dispatched to the priority DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	8dd528abfd	scx_bpfland: pass enqueue flags when dispatching kthreads Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:10 +02:00
Andrea Righi	2bc8f800e7	scx_bpfland: report build id version Use the version string provided by scx_utils:build_id. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	bdb31e98e2	scx_bpfland: show statistics in a more human-readable format Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	f9d7844b2e	scx_bpfland: split direct dispatches and kthread dispatches Show separate statistics for direct dispatches and kthread direct dispatches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:27:59 +02:00
Andrea Righi	cfe2ed063d	scx_bpfland: time-based starvation prevention Tasks are consumed from various DSQs in the following order: per-CPU DSQs => priority DSQ => shared DSQ Tasks in the shared DSQ may be starved by those in the priority DSQ, which in turn may be starved by tasks dispatched to any per-CPU DSQ. To mitigate this, record the timestamp of the last task scheduling event both from the priority DSQ and the shared DSQ. If the starvation threshold is exceeded without consuming a task, the scheduler will be forced to consume a task from the corresponding DSQ. The starvation threshold can be adjusted using the --starvation-thresh command line parameter (default is 5ms). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:52:39 +02:00
Andrea Righi	9e0db4ae17	scx_bpfland: remove unnecessary RCU read protection There is no need to RCU protect the cpumask for the offline CPUs: it is created once when the scheduler is initialized and it's never deallocated. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	cef6ca93cf	scx_bpfland: adjust default time slice to 5ms Reduce the default time slice down to 5ms for a faster reaction and system responsiveness when the system is overcomissioned. This also helps to provide a more predictable level of performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	7d15e3171c	scx_bpfland: ensure task time slice never exceeds the slice_ns limit Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	e8a4d350ad	scx_bpfland: unify dispatching kthreads with direct CPU dispatches Always use direct CPU dispatch for kthreads, there is no need to treat kthreads in a special way, simply reuse direct CPU dispatch to prioritize them. Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(), since we may dispatch multiple tasks to the same per-CPU DSQ now. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 09:38:43 +02:00
Andrea Righi	d2231b0aed	scx_bpfland: drop built-in idle CPU selection logic Small refactoring of the idle CPU selection logic: - optimize idle CPU selection for tasks that can run on a single CPU - drop the built-in idle selection policy and completely rely on the custom one Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 08:54:17 +02:00
Andrea Righi	7c355f50b2	scx_bpfland: use the right cpumask to find any idle CPU We are incorrectly using the SMT idle cpumask to find any idle CPU, fix by using the generic idle cpumask. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-01 20:36:24 +02:00
Andrea Righi	c458f345b4	Merge pull request #408 from sched-ext/bpfland-cpu-hotplug scx_bpfland: support CPU hotplugging	2024-07-01 19:41:00 +02:00
Dan Schatzberg	32ac4b2cff	Merge pull request #389 from dschatzberg/mitosis mitosis: Update synchronization	2024-07-01 09:44:26 -04:00
Andrea Righi	ff7a518d28	scx_bpfland: support CPU hotplugging Implement CPU hotplugging in scx_bpfland without restarting the scheduler. The idle selection logic has been updated to consider online CPUs. Additionally, a cpumask for offline CPUs has been introduced. Tasks that have been dispatched to the DSQs associated with offline CPUs are consumed by the other CPUs that are still online. Moreover, the dependency on the Topology crate is temporarily dropped and instead, /sys/devices/system/cpu/smt/active is used to determine if SMT should be taken into account during idle selection. The Topology crate will be re-introduced later when scx_bpfland will gain more topology-aware capabilities. This fixes #406. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 23:04:13 +02:00
Andrea Righi	d76551bbd3	scx_rusty: fix stats map initialization The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can result in errors such as: Error: Failed to zero stat Caused by: number of values 6 != number of cpus 8 Fix by using num_possible_cpus() to initialize it. Fixes: `263e02f6` ("rusty: Use nr_cpu_ids instead of nr_cpus_possible") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 17:37:14 +02:00
Andrea Righi	74175f5a49	scx_bpfland: properly integrate with meson build Properly honor the meson build `serialize` option. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 21:37:00 +02:00
Andrea Righi	f98c35fd07	Merge pull request #388 from sched-ext/bpfland scheds: introduce scx_bpfland	2024-06-28 21:27:43 +02:00
Andrea Righi	cf4883fbf8	meson: introduce serialize build option With commit `5d20f89a` ("scheds-rust: build rust schedulers in sequence"), schedulers are now built serially one after the other to prevent meson and cargo from forking NxN parallel tasks. However, this change has made building a single scheduler much more cumbersome, due to the chain of dependencies. For example, building scx_rusty using the specific meson target would still result in all schedulers being built, because they all depend on each other. To address this issue, introduce the new meson build option `serialize=true\|false` (default is false). This option allows to disable the schedulers' build chain, restoring the old behavior. With this option enabled, it is now possible to build just a single scheduler, parallelizing the cargo build properly, without triggering the build of the others. Example: $ meson setup build -Dbuildtype=release -Dserialize=false $ meson compile -C build scx_rusty Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 10:17:37 +02:00
Changwoo Min	24a238846e	scx_lavd: optimizing deadline related tunables The competition window was 7.5 msec, half of the targeted latency. However, it is too wide for some workloads, so unrelated tasks may compete with each other. Hence, it is tightened to about 1 msec with LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition. Also, when a system is overloaded, now the time space is stretched more aggressively (i.e., lat_prio^2) when a task's latency priority is low (high value). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-28 09:00:45 +09:00
Andrea Righi	7606b95150	scx_bpfland: introduce maximum time slice lag Introduce a tunable to set a limit of the minimum vruntime that is used when a task is dispatched, as: vtime_min = vtime_now - slice_lag_ns Increasing the time slice lag can make interactive tasks even more responsive at the cost of starving regular and newly created tasks. Default time slice lag is 0. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Andrea Righi	5a44329d45	scheds: introduce scx_bpfland Overview ======== This scheduler is derived from scx_rustland, but it is fully implemented in BFP with minimal user-space Rust part to process command line options, collect metrics and logs out scheduling statistics. Unlike scx_rustland, all scheduling decisions are made by the BPF component. Motivation ========== The primary goal of this scheduler is to act as a performance baseline for comparison with scx_rustland, allowing for a better assessment of the overhead caused by kernel/user-space interactions. It can also be used to deploy prototypes initially tested in the scx_rustland scheduler. In fact, this scheduler is expected to outperform scx_rustland, due to the elimitation of the kernel/user-space overhead. Scheduling policy ================= scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes interactive workloads. Its scheduling policy closely mirrors scx_rustland, but it has been re-implemented in BPF with some small adjustments. Tasks are categorized as either interactive or regular based on their average rate of voluntary context switches per second: tasks that exceed a specific voluntary context switch threshold are classified as interactive. Interactive tasks are prioritized in a higher-priority DSQ, while regular tasks are placed in a lower-priority DSQ. Within each queue, tasks are sorted based on their weighted runtime, using the built-in scx vtime ordering capabilities (scx_bpf_dispatch_vtime()). Moreover, each task gets a time slice budget. When a task is dispatched, it receives a time slice equivalent to the remaining unused portion of its previously allocated time slice (with a minimum threshold applied). This gives latency-sensitive workloads more chances to exceed their time slice when needed to perform short bursts of CPU activity without being interrupted (i.e., real-time audio encoding / decoding workloads). Results ======= According to the initial test results, using the same benchmark "playing a videogame while recompiling the kernel", this scheduler seems to provide a +5% improvement in the frames-per-second (fps) compared to scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2 and Baldur's Gate 3. Initial test results indicate that this scheduler offers around a +5% improvement in frames-per-second (fps) compared to scx_rustland when using the benchmark "playing a video game while recompiling the kernel". This improvement was observed in games such as Cyberpunk 2077, Counter-Strike 2, and Baldur's Gate 3. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Changwoo Min	f86d564d89	scx_lavd: fast path for ops.dispatch() when fully loaded When fully loaded so all CPUs are using, skip checking the cpumask. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-27 18:00:39 +09:00
David Vernet	fe3ce64a9b	Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load"	2024-06-26 17:35:22 -04:00
Changwoo Min	abc6e31fef	scx_lavd: for a forked task, inherit its parent's statistics The old approach was too conservative in running a new task, so when a fork-heavy workload competes with a CPU-bound workload, the fork-heavy one is starved. The new approach solves the starvation problem by inheriting parent's statistics. It seems a good (at least better than old) guess how a new task will behave. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 19:00:10 +09:00
Changwoo Min	ac9c49f5b5	scx_lavd: loosen the deadline when overloaded When the system is highly loaded with compute-intensive tasks, the old setting chokes latensive-intensive tasks, so loosen the dealine when the system is overloaded (> 100% utilization). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 15:06:31 +09:00
Changwoo Min	b32734168b	scx_lavd: print build ID when lavd is loaded When the lavd is loaded, it prints out its build id. It helps to easily identify what version it is when testing. ``` 01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu) ``` Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 10:57:19 +09:00
Dan Schatzberg	d349f86d04	mitosis: Update synchronization The synchronization for mitosis is a bit ad-hoc, working around lack of atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and compiler barriers to get the behaviors we want. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-25 12:44:16 -07:00
David Vernet	d42bae4fcf	rusty: Print build ID when rusty is loaded When someone is testing schedulers, we often have to ask what version the scheduler is running as. Now that we can access the build ID from rust schedulers, let's update scx_rusty to print the build ID when rusty first starts running. This results in output such as the following: ``` [void@maniforge scx]$ rusty 19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug) 19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111 19:04:26 [INFO] DOM[00] mask= 0b00000000111111110000000011111111 19:04:26 [INFO] DOM[01] mask= 0b11111111000000001111111100000000 19:04:26 [INFO] Rusty scheduler started! ``` Signed-off-by: David Vernet <void@manifault.com>	2024-06-25 11:44:46 -05:00
David Vernet	9d9ece11aa	Merge pull request #384 from jfernandez/log-recorder scx_utils: Add log_recorder module for metrics-rs	2024-06-25 11:43:37 -05:00
Changwoo Min	5d0db5c5fe	scx_lavd: revising tunables to reduce micro-stutters This is a second attempt to optimize tunables for a wider range of games. 1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range). Now the latency priority (biased by nice value) will decide which task should run first . The nice value will decide the time slice. 2) The first change will give higher priority to latency-critical task compared to before. For compensation, the slice boost also increased (2x -> 3x). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-25 16:13:32 +09:00
Jose Fernandez	e5984ed016	scx_utils: Add log_recorder module for metrics-rs This change adds a new module to the scx_utils crate that provides a log recorder for metrics-rs. The log recorder will log all metrics to the console at a configurable interval in an easy to read format. Each metric type will be displayed in a separate section. Indentation will be used to show the hierarchy of the metrics. This results in a more verbose output, but it is easier to read and understand. scx_rusty was updated to use the log recorder and all explicit metric logging was removed. Counters will show the total count and the rate of change per second. Counters with an additional label, like `type` in `dispatched_tasks_total` in rusty, will show the count, rate, and percentage of the total count. Counters: dispatched_tasks_total: 65559 [1344.8/s] prev_idle: 44963 (68.6%) [966.5/s] wsync_prev_idle: 15696 (23.9%) [317.3/s] direct_dispatch: 2833 (4.3%) [35.3/s] dsq: 1804 (2.8%) [21.3/s] wsync: 262 (0.4%) [4.3/s] direct_greedy: 1 (0.0%) [0.0/s] pinned: 0 (0.0%) [0.0/s] greedy_idle: 0 (0.0%) [0.0/s] greedy_xnuma: 0 (0.0%) [0.0/s] direct_greedy_far: 0 (0.0%) [0.0/s] greedy_local: 0 (0.0%) [0.0/s] dl_clamped_total: 1290 [20.3/s] dl_preset_total: 514 [1.0/s] kick_greedy_total: 6 [0.3/s] lb_data_errors_total: 0 [0.0/s] load_balance_total: 0 [0.0/s] repatriate_total: 0 [0.0/s] task_errors_total: 0 [0.0/s] Gauges will show the last set value: Gauges: slice_length_us: 20000.00 Histograms will show the average, min, and max. The histogram will be reset after each log interval to avoid memory leaks, since the data structure that holds the samples is unbounded. Histograms: cpu_busy_pct: avg=1.66 min=1.16 max=2.16 load_avg node=0: avg=0.31 min=0.23 max=0.39 load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39 processing_duration_us: avg=297.50 min=296.00 max=299.00 Signed-off-by: Jose Fernandez <josef@netflix.com>	2024-06-24 18:45:02 -06:00
David Vernet	8059acb634	Merge pull request #381 from vax-r/rusty_dom_load_status_check scx_rusty: Pull domain status check	2024-06-24 17:54:54 -05:00
David Vernet	55ee210d42	Merge pull request #382 from vax-r/rusty_refactor scx_rusty: Refactor ridx assignment in populate_tasks_by_load	2024-06-24 17:47:55 -05:00
Changwoo Min	016229cbcf	scx_lavd: revising tunables for less-preemptive games In some games (e.g., Elden Ring), it was observed that preemption happens much less frequently. The reason is that tasks' runtime per schedule is similar, so it does not meet the existing criteria. To alleviate the problem, the following three tunables are revised: 1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to trigger more preemption. 2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz kernels. 3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-24 00:27:33 +09:00
I Hsin Cheng	eab234a74f	scx_rusty: Refactor ridx assignment in populate_tasks_by_load Origin assignment of the variable ridx is equivalent to comparing between "ridx" and "wids - MAX_PIDS". Using u64 max library helper function to perform the comparison and provide better readability. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-23 21:58:51 +08:00
I Hsin Cheng	84b9ac4dce	scx_rusty: Pull domain status check Check whether the BalanceState of pull_dom.load inside function try_find_move_task is actually the variant NeedsPull. It'll perform task migration in abit more conservative manner when the system is under high loading situation. Experiments are performed when the system is compiling linux kernel and undergoing a large amount of I/O operation at the same time using fio. The result showns that before the modification, there're 12,6617 times of task migrations system wide. After the modification, there're 11,5419 times of task migrations system wide. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-23 21:38:23 +08:00
David Vernet	5038f54701	Merge pull request #377 from jfernandez/metrics-rs rusty: Integrate stats with the metrics framework	2024-06-21 15:23:20 -05:00
David Vernet	3bd15be840	rlfifo: Use topo.nr_cpu_ids() instead of topo.nr_cpus_possible() In scx_rlfifo, we're currently using topo.nr_cpus_possible() to determine how many possible CPU IDs we could have on the system. To properly support systems whose disabled CPUs may be in the middle of the range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that we don't accidentally dispatch to an invalid DSQ. Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:20 -05:00
David Vernet	263e02f644	rusty: Use nr_cpu_ids instead of nr_cpus_possible In scx_rusty, we're currently using topo.nr_cpus_possible() to determine how many possible CPU IDs we could have on the system. scx_rusty already accounts for offlined CPUs, so to properly support systems whose disabled CPUs may be in the middle of the range of possible CPU IDs, let's instead use topo.nr_cpu_ids(). Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:19 -05:00
David Vernet	bdbf4b9c05	topo: Return nr_cpu_ids from host Topology In some cases, a host may have an odd topology where there are gaps in CPU IDs (including between possible CPUs). A common pattern in schedulers is to perform allocations for every possible CPU ID, such as creating a per-cpu DSQ. In order to avoid confusing schedulers, let's track the maximum CPU ID on a system so that we can return the number of CPU IDs on the system which is inclusive of gaps. We also update scx_rustland in this change to accommodate the fact that we no longer export nr_cpus_possible() from TopologyMap. Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:13 -05:00
Jose Fernandez	83373b1f4e	rusty: Integrate stats with the metrics framework We need a layer of indirection between the stats collection and their output destinations. Currently, stats are only printed to stdout. Our goal is to integrate with various telemetry systems such as Prometheus, StatsD, and custom metric backends like those used by Meta and Netflix. Importantly, adding a new backend should not require changes to the existing stats code. This patch introduces the `metrics` [1] crate, which provides a framework for defining metrics and publishing them to different backends. The initial implementation includes the `dispatched_tasks_count` metric, tagged with `type`. This metric increments every time a task is dispatched, emitting the raw count instead of a percentage. A monotonic counter is the most suitable metric type for this use case, as percentages can be calculated at query time if needed. Existing logged metrics continue to print percentages and remain unchanged. A new flag, `--enable-prometheus`, has been added. When enabled, it starts a Prometheus endpoint on port 9000 (default is false). This endpoint allows metrics to be charted in Prometheus or Grafana dashboards. Future changes will migrate additional stats to this framework and add support for other backends. [1] https://metrics.rs/ Signed-off-by: Jose Fernandez <josef@netflix.com>	2024-06-21 10:18:44 -06:00
Changwoo Min	9c21ace276	Merge pull request #373 from vax-r/lavd_reuse scx_lavd: Reuse can_task1_kick_task2	2024-06-19 15:29:05 +09:00
I Hsin Cheng	99960ad960	scx_lavd: Reuse can_task1_kick_task2 Use the function can_task1_kick_task2() to replace places which also checking the comp_preemption_info between two cpus for better consistency. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-19 11:01:31 +08:00
Changwoo Min	691869e83f	Merge pull request #369 from sched-ext/lavd-fix-pick-cpu scx_lavd: properly check for idle CPUs in pick_cpu()	2024-06-19 09:23:17 +09:00
Changwoo Min	dad25f1b5d	Merge pull request #368 from multics69/lavd-perf-misc scx_lavd: misc performance tuning and code clean up	2024-06-19 07:26:52 +09:00
Andrea Righi	bad9ed13ef	scx_lavd: properly check for idle CPUs in pick_cpu() It seems that we are not updating `is_idle` when we find an idle CPU with pick_cpu(), causing unnecessary rescheduling events when select_cpu() is called. To resolve this, ensure that the is_idle state is correctly set. Additionally, always ensure that the task is dispatched to the local DSQ immediately upon finding (and reserving) an idle CPU. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-18 17:36:39 +02:00
Changwoo Min	632fa9e4f2	scx_lavd: misc code clean up - clean up u63 and u32 usages in structures to reduce struct size - refactoring pick_cpu() for readability Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-18 18:11:49 +09:00
Changwoo Min	5165bf5a03	scx_lavd: tuning CPU frequency scaling The required CPU performance (cpuperf) was set to 1024 (100%) when the CPU utilization was 100%. When a sudden load spike happens, it makes the system adapt slowly in the next interval. The new scheme always reserves some headroom in advance, so it sets cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some room to adapt in advance. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-18 18:11:49 +09:00
I Hsin Cheng	94e3616c02	scx_rusty: Refactor lookup operation for new_domc in task_set_domain Modify the execution sequence before lookup operation for new_domc. If new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely going to fail so we don't have to wait until we found that new_domc is NULL, clearing of cpumask and return operation should be done directly in that case. Plus we should avoid using try_lookup_dom_ctx outside the context of lookup_dom_ctx, as it can keep the interface's consistency. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-18 12:58:17 +08:00
David Vernet	0184444285	Merge pull request #366 from sched-ext/task_set_domain_global rusty: Make dom_xfer_task() a global prog	2024-06-17 14:43:45 -05:00
David Vernet	dfe0ffb312	Merge pull request #347 from sched-ext/rusty_cleanup rusty: Clean up some logic in rusty	2024-06-17 14:26:53 -05:00
David Vernet	7985ee556e	rusty: Clean up dispatch logic The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up so that we're just comparing dom ids rather than iterating over arrays nested inside of pcpu context. Signed-off-by: David Vernet <void@manifault.com>	2024-06-17 14:24:30 -05:00
David Vernet	87aa86845d	rusty: Refactor + slightly improve wake_sync Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to see if the waker CPU's runqueue is empty, and then migrate the wakee there if so. We'll want to expand this to be more thorough, such as: - Checking to see if prev_cpu and waker_cpu share the same LLC when determining where to migrate - Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores - ... Right now all of that code is just a big blob in the middle of rusty_select_cpu(). Let's pull it into its own function to improve readability, and also add some logic to stay on prev_cpu if it shares an LLC with the waker. Signed-off-by: David Vernet <void@manifault.com>	2024-06-17 14:24:29 -05:00
David Vernet	fed66fa571	rusty: Make dom_xfer_task() a global prog It seems that task_set_domain() is nearly at the point where it can cause the verifier to get confused and think that it's exceeding the number of available instructions per program. I've seen this a number of times when making small changes to task_set_domain(), and it's once again happened @vax-r (I-Hsin Cheng) made a small cleanup change to rusty in https://github.com/sched-ext/scx/pull/362. To avoid this, let's just make dom_xfer_task() a separate global program so that the verifier doens't have to worry about branch pruning, etc depending on what the caller does. This should hopefully make task_set_domain() (and its callers) much less brittle. Signed-off-by: David Vernet <void@manifault.com>	2024-06-17 14:22:26 -05:00
Tejun Heo	dde2942125	compat: Drop __COMPAT_scx_bpf_cpuperf_() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_(). The open helper macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.	2024-06-16 06:16:53 -10:00
Tejun Heo	13e8388e1e	compat: Drop __COMPAT_HAS_CPUMASKS In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros now check the existence of scx_bpf_nr_cpu_ids() and abort if not.	2024-06-16 06:12:06 -10:00
Tejun Heo	5b5e5be906	compat: Drop __COMPAT_SCX_KICK_IDLE In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros now check the existence of SCX_KICK_IDLE and abort if not.	2024-06-15 20:24:15 -10:00
Tejun Heo	7c9aedaefe	compat: Drop __COMPAT_scx_bpf_switch_all() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.	2024-06-15 20:03:37 -10:00
Tejun Heo	dd6255a601	Merge pull request #359 from sched-ext/htejun/cosmetic common.bpf.h: Cosmetic changes	2024-06-15 06:42:00 -10:00
Andrea Righi	cb20a6f136	scx_rlfifo: dispatch all tasks on the first CPU available With commit `786ec0c0` ("scx_rlfifo: schedule all tasks in user-space") all the scheduling decisions are now happening in user-space. This also bypasses the built-in idle selection logic, delegating the CPU selection for each task to the user-space scheduler. The easiest way to distribute tasks across the available CPUs is to simply allow to dispatch them on the first CPU available. In this way the scheduler becomes usable in practical scenarios and at the same time it also maintains its simplicity. This allows to spread all tasks across all the available CPUs Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:13:53 +02:00
Andrea Righi	786ec0c04a	scx_rlfifo: schedule all tasks in user-space Disable all the BPF optimization shortcuts by default and force all tasks to be processed by the user-space scheduler. Given that the primary goal of this scheduler is to offer a straightforward and intuitive example for experimental purposes, this change simplifies the process for individuals looking to experiment, allowing them to apply changes to user-space code and quickly observe the effects, without dealing with any in-kernel optimizations. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:07:39 +02:00
Andrea Righi	59f47d6659	scx_rlfifo: improve code readability No functional change, just add some comments to better describe the parameters used when initializing the main BpfScheduler object. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:05:28 +02:00
Tejun Heo	d7677e3e5c	scx/common.bpf.h: Rename bpf_log2[l]() to u32/64_log2() The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and bpf_log2l() to u64_log2(). While at it, relocate them below compiler directive helpers.	2024-06-14 15:22:39 -10:00
Andrea Righi	8c6fe540eb	scx_rustland: prevent excessive starvation when system is congested Keep track of the maximum vruntime among all tasks and flush them if the difference between the maximum and minimum vruntime exceeds slice_ns. This helps to prevent excessive starvation, as every task is guaranteed to be dispatched within the slice_ns time limit. Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-14 20:09:19 +02:00

1 2 3 4 5 ...

518 Commits