JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-12-02 05:47:12 +00:00

Author	SHA1	Message	Date
Changwoo Min	c72e063680	scx_lavd: do not use lat_prio_to_greedy_thresholds With other optimizations, lat_prio_to_greedy_thresholds is not effective any more. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:01 +09:00
Changwoo Min	9ed488798e	scx_lavd: use task's runtime to determine its deaddline It has an effect of further perferring shorter jobs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:34:25 +09:00
Changwoo Min	e081b2a294	scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:33:56 +09:00
Andrea Righi	995577762a	scx_bpfland: refill task time slice Every time we need to dispatch a task re-evalate its time slice as: (unused_time_slice + min_time_slice) / 2 This allows to refill the time slice for tasks that haven't used much of their previously assigned time, improving fairness. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	6a64182ef2	scx_bpfland: always classify interactive tasks Make sure to always classify interactive tasks, even when the system is not fully utilized. This ensures that if the system suddenly becomes overloaded, we already know which tasks need to be dispatched to the priority DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	8dd528abfd	scx_bpfland: pass enqueue flags when dispatching kthreads Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:10 +02:00
Andrea Righi	fc0d1bd003	Merge pull request #415 from sched-ext/bpfland-output scx_bpfland: additional stats and output improvements	2024-07-05 19:50:07 +02:00
Tejun Heo	af5e89e73c	Merge pull request #412 from vax-r/flatcg_delta_fetch scx_flatcg: Make good use of __sync_fetch_and_sub()	2024-07-05 07:39:22 -10:00
Tejun Heo	14d0a0ef64	Merge pull request #411 from vax-r/Fix_typo scx_flatcg: Fix_typo	2024-07-05 07:35:31 -10:00
Andrea Righi	2bc8f800e7	scx_bpfland: report build id version Use the version string provided by scx_utils:build_id. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	bdb31e98e2	scx_bpfland: show statistics in a more human-readable format Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	f9d7844b2e	scx_bpfland: split direct dispatches and kthread dispatches Show separate statistics for direct dispatches and kthread direct dispatches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:27:59 +02:00
I Hsin Cheng	aae826b1b3	scx_flatcg: Make good use of __sync_fetch_and_sub() Fetch the value of "delta" directly from the returned value from __sync_fetch_and_sub, as it returns the origin value of cgc->cvtime_delta. Additional fetching instruction of cgc->cvtime_delta would be redundant here. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-05 01:03:20 +08:00
I Hsin Cheng	3e52761487	scx_flatcg: Fix_typo Fix "oppotunistic" to "opportunistic". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-04 22:04:40 +08:00
Andrea Righi	cfe2ed063d	scx_bpfland: time-based starvation prevention Tasks are consumed from various DSQs in the following order: per-CPU DSQs => priority DSQ => shared DSQ Tasks in the shared DSQ may be starved by those in the priority DSQ, which in turn may be starved by tasks dispatched to any per-CPU DSQ. To mitigate this, record the timestamp of the last task scheduling event both from the priority DSQ and the shared DSQ. If the starvation threshold is exceeded without consuming a task, the scheduler will be forced to consume a task from the corresponding DSQ. The starvation threshold can be adjusted using the --starvation-thresh command line parameter (default is 5ms). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:52:39 +02:00
Andrea Righi	9e0db4ae17	scx_bpfland: remove unnecessary RCU read protection There is no need to RCU protect the cpumask for the offline CPUs: it is created once when the scheduler is initialized and it's never deallocated. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	cef6ca93cf	scx_bpfland: adjust default time slice to 5ms Reduce the default time slice down to 5ms for a faster reaction and system responsiveness when the system is overcomissioned. This also helps to provide a more predictable level of performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	7d15e3171c	scx_bpfland: ensure task time slice never exceeds the slice_ns limit Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	e8a4d350ad	scx_bpfland: unify dispatching kthreads with direct CPU dispatches Always use direct CPU dispatch for kthreads, there is no need to treat kthreads in a special way, simply reuse direct CPU dispatch to prioritize them. Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(), since we may dispatch multiple tasks to the same per-CPU DSQ now. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 09:38:43 +02:00
Andrea Righi	d2231b0aed	scx_bpfland: drop built-in idle CPU selection logic Small refactoring of the idle CPU selection logic: - optimize idle CPU selection for tasks that can run on a single CPU - drop the built-in idle selection policy and completely rely on the custom one Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 08:54:17 +02:00
Andrea Righi	7c355f50b2	scx_bpfland: use the right cpumask to find any idle CPU We are incorrectly using the SMT idle cpumask to find any idle CPU, fix by using the generic idle cpumask. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-01 20:36:24 +02:00
Andrea Righi	c458f345b4	Merge pull request #408 from sched-ext/bpfland-cpu-hotplug scx_bpfland: support CPU hotplugging	2024-07-01 19:41:00 +02:00
Dan Schatzberg	32ac4b2cff	Merge pull request #389 from dschatzberg/mitosis mitosis: Update synchronization	2024-07-01 09:44:26 -04:00
Andrea Righi	ff7a518d28	scx_bpfland: support CPU hotplugging Implement CPU hotplugging in scx_bpfland without restarting the scheduler. The idle selection logic has been updated to consider online CPUs. Additionally, a cpumask for offline CPUs has been introduced. Tasks that have been dispatched to the DSQs associated with offline CPUs are consumed by the other CPUs that are still online. Moreover, the dependency on the Topology crate is temporarily dropped and instead, /sys/devices/system/cpu/smt/active is used to determine if SMT should be taken into account during idle selection. The Topology crate will be re-introduced later when scx_bpfland will gain more topology-aware capabilities. This fixes #406. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 23:04:13 +02:00
Andrea Righi	d76551bbd3	scx_rusty: fix stats map initialization The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can result in errors such as: Error: Failed to zero stat Caused by: number of values 6 != number of cpus 8 Fix by using num_possible_cpus() to initialize it. Fixes: `263e02f6` ("rusty: Use nr_cpu_ids instead of nr_cpus_possible") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 17:37:14 +02:00
Andrea Righi	74175f5a49	scx_bpfland: properly integrate with meson build Properly honor the meson build `serialize` option. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 21:37:00 +02:00
Andrea Righi	f98c35fd07	Merge pull request #388 from sched-ext/bpfland scheds: introduce scx_bpfland	2024-06-28 21:27:43 +02:00
Andrea Righi	cf4883fbf8	meson: introduce serialize build option With commit `5d20f89a` ("scheds-rust: build rust schedulers in sequence"), schedulers are now built serially one after the other to prevent meson and cargo from forking NxN parallel tasks. However, this change has made building a single scheduler much more cumbersome, due to the chain of dependencies. For example, building scx_rusty using the specific meson target would still result in all schedulers being built, because they all depend on each other. To address this issue, introduce the new meson build option `serialize=true\|false` (default is false). This option allows to disable the schedulers' build chain, restoring the old behavior. With this option enabled, it is now possible to build just a single scheduler, parallelizing the cargo build properly, without triggering the build of the others. Example: $ meson setup build -Dbuildtype=release -Dserialize=false $ meson compile -C build scx_rusty Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 10:17:37 +02:00
Changwoo Min	24a238846e	scx_lavd: optimizing deadline related tunables The competition window was 7.5 msec, half of the targeted latency. However, it is too wide for some workloads, so unrelated tasks may compete with each other. Hence, it is tightened to about 1 msec with LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition. Also, when a system is overloaded, now the time space is stretched more aggressively (i.e., lat_prio^2) when a task's latency priority is low (high value). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-28 09:00:45 +09:00
Andrea Righi	7606b95150	scx_bpfland: introduce maximum time slice lag Introduce a tunable to set a limit of the minimum vruntime that is used when a task is dispatched, as: vtime_min = vtime_now - slice_lag_ns Increasing the time slice lag can make interactive tasks even more responsive at the cost of starving regular and newly created tasks. Default time slice lag is 0. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Andrea Righi	5a44329d45	scheds: introduce scx_bpfland Overview ======== This scheduler is derived from scx_rustland, but it is fully implemented in BFP with minimal user-space Rust part to process command line options, collect metrics and logs out scheduling statistics. Unlike scx_rustland, all scheduling decisions are made by the BPF component. Motivation ========== The primary goal of this scheduler is to act as a performance baseline for comparison with scx_rustland, allowing for a better assessment of the overhead caused by kernel/user-space interactions. It can also be used to deploy prototypes initially tested in the scx_rustland scheduler. In fact, this scheduler is expected to outperform scx_rustland, due to the elimitation of the kernel/user-space overhead. Scheduling policy ================= scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes interactive workloads. Its scheduling policy closely mirrors scx_rustland, but it has been re-implemented in BPF with some small adjustments. Tasks are categorized as either interactive or regular based on their average rate of voluntary context switches per second: tasks that exceed a specific voluntary context switch threshold are classified as interactive. Interactive tasks are prioritized in a higher-priority DSQ, while regular tasks are placed in a lower-priority DSQ. Within each queue, tasks are sorted based on their weighted runtime, using the built-in scx vtime ordering capabilities (scx_bpf_dispatch_vtime()). Moreover, each task gets a time slice budget. When a task is dispatched, it receives a time slice equivalent to the remaining unused portion of its previously allocated time slice (with a minimum threshold applied). This gives latency-sensitive workloads more chances to exceed their time slice when needed to perform short bursts of CPU activity without being interrupted (i.e., real-time audio encoding / decoding workloads). Results ======= According to the initial test results, using the same benchmark "playing a videogame while recompiling the kernel", this scheduler seems to provide a +5% improvement in the frames-per-second (fps) compared to scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2 and Baldur's Gate 3. Initial test results indicate that this scheduler offers around a +5% improvement in frames-per-second (fps) compared to scx_rustland when using the benchmark "playing a video game while recompiling the kernel". This improvement was observed in games such as Cyberpunk 2077, Counter-Strike 2, and Baldur's Gate 3. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Changwoo Min	f86d564d89	scx_lavd: fast path for ops.dispatch() when fully loaded When fully loaded so all CPUs are using, skip checking the cpumask. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-27 18:00:39 +09:00
David Vernet	fe3ce64a9b	Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load"	2024-06-26 17:35:22 -04:00
Changwoo Min	abc6e31fef	scx_lavd: for a forked task, inherit its parent's statistics The old approach was too conservative in running a new task, so when a fork-heavy workload competes with a CPU-bound workload, the fork-heavy one is starved. The new approach solves the starvation problem by inheriting parent's statistics. It seems a good (at least better than old) guess how a new task will behave. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 19:00:10 +09:00
Changwoo Min	ac9c49f5b5	scx_lavd: loosen the deadline when overloaded When the system is highly loaded with compute-intensive tasks, the old setting chokes latensive-intensive tasks, so loosen the dealine when the system is overloaded (> 100% utilization). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 15:06:31 +09:00
Changwoo Min	b32734168b	scx_lavd: print build ID when lavd is loaded When the lavd is loaded, it prints out its build id. It helps to easily identify what version it is when testing. ``` 01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu) ``` Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 10:57:19 +09:00
Dan Schatzberg	d349f86d04	mitosis: Update synchronization The synchronization for mitosis is a bit ad-hoc, working around lack of atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and compiler barriers to get the behaviors we want. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-25 12:44:16 -07:00
David Vernet	d42bae4fcf	rusty: Print build ID when rusty is loaded When someone is testing schedulers, we often have to ask what version the scheduler is running as. Now that we can access the build ID from rust schedulers, let's update scx_rusty to print the build ID when rusty first starts running. This results in output such as the following: ``` [void@maniforge scx]$ rusty 19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug) 19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111 19:04:26 [INFO] DOM[00] mask= 0b00000000111111110000000011111111 19:04:26 [INFO] DOM[01] mask= 0b11111111000000001111111100000000 19:04:26 [INFO] Rusty scheduler started! ``` Signed-off-by: David Vernet <void@manifault.com>	2024-06-25 11:44:46 -05:00
David Vernet	9d9ece11aa	Merge pull request #384 from jfernandez/log-recorder scx_utils: Add log_recorder module for metrics-rs	2024-06-25 11:43:37 -05:00
Changwoo Min	5d0db5c5fe	scx_lavd: revising tunables to reduce micro-stutters This is a second attempt to optimize tunables for a wider range of games. 1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range). Now the latency priority (biased by nice value) will decide which task should run first . The nice value will decide the time slice. 2) The first change will give higher priority to latency-critical task compared to before. For compensation, the slice boost also increased (2x -> 3x). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-25 16:13:32 +09:00
Jose Fernandez	e5984ed016	scx_utils: Add log_recorder module for metrics-rs This change adds a new module to the scx_utils crate that provides a log recorder for metrics-rs. The log recorder will log all metrics to the console at a configurable interval in an easy to read format. Each metric type will be displayed in a separate section. Indentation will be used to show the hierarchy of the metrics. This results in a more verbose output, but it is easier to read and understand. scx_rusty was updated to use the log recorder and all explicit metric logging was removed. Counters will show the total count and the rate of change per second. Counters with an additional label, like `type` in `dispatched_tasks_total` in rusty, will show the count, rate, and percentage of the total count. Counters: dispatched_tasks_total: 65559 [1344.8/s] prev_idle: 44963 (68.6%) [966.5/s] wsync_prev_idle: 15696 (23.9%) [317.3/s] direct_dispatch: 2833 (4.3%) [35.3/s] dsq: 1804 (2.8%) [21.3/s] wsync: 262 (0.4%) [4.3/s] direct_greedy: 1 (0.0%) [0.0/s] pinned: 0 (0.0%) [0.0/s] greedy_idle: 0 (0.0%) [0.0/s] greedy_xnuma: 0 (0.0%) [0.0/s] direct_greedy_far: 0 (0.0%) [0.0/s] greedy_local: 0 (0.0%) [0.0/s] dl_clamped_total: 1290 [20.3/s] dl_preset_total: 514 [1.0/s] kick_greedy_total: 6 [0.3/s] lb_data_errors_total: 0 [0.0/s] load_balance_total: 0 [0.0/s] repatriate_total: 0 [0.0/s] task_errors_total: 0 [0.0/s] Gauges will show the last set value: Gauges: slice_length_us: 20000.00 Histograms will show the average, min, and max. The histogram will be reset after each log interval to avoid memory leaks, since the data structure that holds the samples is unbounded. Histograms: cpu_busy_pct: avg=1.66 min=1.16 max=2.16 load_avg node=0: avg=0.31 min=0.23 max=0.39 load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39 processing_duration_us: avg=297.50 min=296.00 max=299.00 Signed-off-by: Jose Fernandez <josef@netflix.com>	2024-06-24 18:45:02 -06:00
David Vernet	8059acb634	Merge pull request #381 from vax-r/rusty_dom_load_status_check scx_rusty: Pull domain status check	2024-06-24 17:54:54 -05:00
David Vernet	55ee210d42	Merge pull request #382 from vax-r/rusty_refactor scx_rusty: Refactor ridx assignment in populate_tasks_by_load	2024-06-24 17:47:55 -05:00
Changwoo Min	016229cbcf	scx_lavd: revising tunables for less-preemptive games In some games (e.g., Elden Ring), it was observed that preemption happens much less frequently. The reason is that tasks' runtime per schedule is similar, so it does not meet the existing criteria. To alleviate the problem, the following three tunables are revised: 1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to trigger more preemption. 2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz kernels. 3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-24 00:27:33 +09:00
I Hsin Cheng	eab234a74f	scx_rusty: Refactor ridx assignment in populate_tasks_by_load Origin assignment of the variable ridx is equivalent to comparing between "ridx" and "wids - MAX_PIDS". Using u64 max library helper function to perform the comparison and provide better readability. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-23 21:58:51 +08:00
I Hsin Cheng	84b9ac4dce	scx_rusty: Pull domain status check Check whether the BalanceState of pull_dom.load inside function try_find_move_task is actually the variant NeedsPull. It'll perform task migration in abit more conservative manner when the system is under high loading situation. Experiments are performed when the system is compiling linux kernel and undergoing a large amount of I/O operation at the same time using fio. The result showns that before the modification, there're 12,6617 times of task migrations system wide. After the modification, there're 11,5419 times of task migrations system wide. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-23 21:38:23 +08:00
David Vernet	5038f54701	Merge pull request #377 from jfernandez/metrics-rs rusty: Integrate stats with the metrics framework	2024-06-21 15:23:20 -05:00
David Vernet	9919b71fd4	Merge pull request #379 from sched-ext/topo_nr_cpu_ids Add topo.nr_cpu_ids() to Topology crate	2024-06-21 13:35:05 -05:00
David Vernet	3bd15be840	rlfifo: Use topo.nr_cpu_ids() instead of topo.nr_cpus_possible() In scx_rlfifo, we're currently using topo.nr_cpus_possible() to determine how many possible CPU IDs we could have on the system. To properly support systems whose disabled CPUs may be in the middle of the range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that we don't accidentally dispatch to an invalid DSQ. Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:20 -05:00
David Vernet	263e02f644	rusty: Use nr_cpu_ids instead of nr_cpus_possible In scx_rusty, we're currently using topo.nr_cpus_possible() to determine how many possible CPU IDs we could have on the system. scx_rusty already accounts for offlined CPUs, so to properly support systems whose disabled CPUs may be in the middle of the range of possible CPU IDs, let's instead use topo.nr_cpu_ids(). Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:19 -05:00
David Vernet	bdbf4b9c05	topo: Return nr_cpu_ids from host Topology In some cases, a host may have an odd topology where there are gaps in CPU IDs (including between possible CPUs). A common pattern in schedulers is to perform allocations for every possible CPU ID, such as creating a per-cpu DSQ. In order to avoid confusing schedulers, let's track the maximum CPU ID on a system so that we can return the number of CPU IDs on the system which is inclusive of gaps. We also update scx_rustland in this change to accommodate the fact that we no longer export nr_cpus_possible() from TopologyMap. Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:13 -05:00
Jose Fernandez	83373b1f4e	rusty: Integrate stats with the metrics framework We need a layer of indirection between the stats collection and their output destinations. Currently, stats are only printed to stdout. Our goal is to integrate with various telemetry systems such as Prometheus, StatsD, and custom metric backends like those used by Meta and Netflix. Importantly, adding a new backend should not require changes to the existing stats code. This patch introduces the `metrics` [1] crate, which provides a framework for defining metrics and publishing them to different backends. The initial implementation includes the `dispatched_tasks_count` metric, tagged with `type`. This metric increments every time a task is dispatched, emitting the raw count instead of a percentage. A monotonic counter is the most suitable metric type for this use case, as percentages can be calculated at query time if needed. Existing logged metrics continue to print percentages and remain unchanged. A new flag, `--enable-prometheus`, has been added. When enabled, it starts a Prometheus endpoint on port 9000 (default is false). This endpoint allows metrics to be charted in Prometheus or Grafana dashboards. Future changes will migrate additional stats to this framework and add support for other backends. [1] https://metrics.rs/ Signed-off-by: Jose Fernandez <josef@netflix.com>	2024-06-21 10:18:44 -06:00
Tejun Heo	7a40059b55	Revert "scx_flatcg: Keep cgroup rb nodes stashed" This reverts commit `3b7f33ea1b`. I haven't root caused it yet but it's easy to reproduce stall and trigger the watchdog after the commit - just running stress in multiple cgroups easily triggers stalls after a couple tens of seconds. Let's revert it for now.	2024-06-19 14:44:26 -10:00
Changwoo Min	9c21ace276	Merge pull request #373 from vax-r/lavd_reuse scx_lavd: Reuse can_task1_kick_task2	2024-06-19 15:29:05 +09:00
I Hsin Cheng	99960ad960	scx_lavd: Reuse can_task1_kick_task2 Use the function can_task1_kick_task2() to replace places which also checking the comp_preemption_info between two cpus for better consistency. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-19 11:01:31 +08:00
Changwoo Min	691869e83f	Merge pull request #369 from sched-ext/lavd-fix-pick-cpu scx_lavd: properly check for idle CPUs in pick_cpu()	2024-06-19 09:23:17 +09:00
Changwoo Min	dad25f1b5d	Merge pull request #368 from multics69/lavd-perf-misc scx_lavd: misc performance tuning and code clean up	2024-06-19 07:26:52 +09:00
Andrea Righi	bad9ed13ef	scx_lavd: properly check for idle CPUs in pick_cpu() It seems that we are not updating `is_idle` when we find an idle CPU with pick_cpu(), causing unnecessary rescheduling events when select_cpu() is called. To resolve this, ensure that the is_idle state is correctly set. Additionally, always ensure that the task is dispatched to the local DSQ immediately upon finding (and reserving) an idle CPU. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-18 17:36:39 +02:00
Changwoo Min	632fa9e4f2	scx_lavd: misc code clean up - clean up u63 and u32 usages in structures to reduce struct size - refactoring pick_cpu() for readability Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-18 18:11:49 +09:00
Changwoo Min	5165bf5a03	scx_lavd: tuning CPU frequency scaling The required CPU performance (cpuperf) was set to 1024 (100%) when the CPU utilization was 100%. When a sudden load spike happens, it makes the system adapt slowly in the next interval. The new scheme always reserves some headroom in advance, so it sets cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some room to adapt in advance. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-18 18:11:49 +09:00
I Hsin Cheng	94e3616c02	scx_rusty: Refactor lookup operation for new_domc in task_set_domain Modify the execution sequence before lookup operation for new_domc. If new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely going to fail so we don't have to wait until we found that new_domc is NULL, clearing of cpumask and return operation should be done directly in that case. Plus we should avoid using try_lookup_dom_ctx outside the context of lookup_dom_ctx, as it can keep the interface's consistency. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-18 12:58:17 +08:00
Tejun Heo	819ffd527f	Merge pull request #367 from sched-ext/htejun/dsq-iter-fix scx/compat.bpf.h: Fix __COMPAT_scx_bpf_consume_task() and improve scx_qmap example	2024-06-17 10:29:38 -10:00
Tejun Heo	1012e3a6db	scx/compat.bpf.h: Fix __COMPAT_scx_bpf_consume_task() and improve scx_qmap example __COMPAT_scx_bpf_consume_task() wasn't calling scx_bpf_consume_task() at all and was always returning false. Fix it. Also, update scx_qmap usage example so that it matches cgroup ID rather than comm prefix. This should make testing with multiple processes a bit easier.	2024-06-17 10:11:06 -10:00
David Vernet	0184444285	Merge pull request #366 from sched-ext/task_set_domain_global rusty: Make dom_xfer_task() a global prog	2024-06-17 14:43:45 -05:00
David Vernet	dfe0ffb312	Merge pull request #347 from sched-ext/rusty_cleanup rusty: Clean up some logic in rusty	2024-06-17 14:26:53 -05:00
David Vernet	7985ee556e	rusty: Clean up dispatch logic The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up so that we're just comparing dom ids rather than iterating over arrays nested inside of pcpu context. Signed-off-by: David Vernet <void@manifault.com>	2024-06-17 14:24:30 -05:00
David Vernet	87aa86845d	rusty: Refactor + slightly improve wake_sync Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to see if the waker CPU's runqueue is empty, and then migrate the wakee there if so. We'll want to expand this to be more thorough, such as: - Checking to see if prev_cpu and waker_cpu share the same LLC when determining where to migrate - Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores - ... Right now all of that code is just a big blob in the middle of rusty_select_cpu(). Let's pull it into its own function to improve readability, and also add some logic to stay on prev_cpu if it shares an LLC with the waker. Signed-off-by: David Vernet <void@manifault.com>	2024-06-17 14:24:29 -05:00
David Vernet	fed66fa571	rusty: Make dom_xfer_task() a global prog It seems that task_set_domain() is nearly at the point where it can cause the verifier to get confused and think that it's exceeding the number of available instructions per program. I've seen this a number of times when making small changes to task_set_domain(), and it's once again happened @vax-r (I-Hsin Cheng) made a small cleanup change to rusty in https://github.com/sched-ext/scx/pull/362. To avoid this, let's just make dom_xfer_task() a separate global program so that the verifier doens't have to worry about branch pruning, etc depending on what the caller does. This should hopefully make task_set_domain() (and its callers) much less brittle. Signed-off-by: David Vernet <void@manifault.com>	2024-06-17 14:22:26 -05:00
Tejun Heo	b6ebdc635a	compat: Compact min requirement checks Let's check only the latest one.	2024-06-16 06:53:58 -10:00
Tejun Heo	aeb805a93e	compat: Drop support for missing sched_ext_ops.dump() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop support for missing sched_ext_ops.dump(). The open helper macros now check the existence of the fields and abort if missing.	2024-06-16 06:43:43 -10:00
Tejun Heo	4cca1e9acf	compat: Drop support for missing sched_ext_ops.tick() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop support for missing sched_ext_ops.tick(). The open helper macros now check the existence of the field and abort if missing.	2024-06-16 06:40:28 -10:00
Tejun Heo	970c04b43a	compat: Drop support for missing sched_ext_ops.exit_dump_len In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop support for missing sched_ext_ops.exit_dump_len. The open helper macros now check the existence of the field and abort if missing.	2024-06-16 06:37:34 -10:00
Tejun Heo	046bdfd5e0	compat: Drop support for missing sched_ext_ops.hotplug_seq In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop support for missing sched_ext_ops.hotplug_seq. The open helper macros now check the existence of the field and abort if missing.	2024-06-16 06:34:59 -10:00
Tejun Heo	dde2942125	compat: Drop __COMPAT_scx_bpf_cpuperf_() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_(). The open helper macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.	2024-06-16 06:16:53 -10:00
Tejun Heo	13e8388e1e	compat: Drop __COMPAT_HAS_CPUMASKS In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros now check the existence of scx_bpf_nr_cpu_ids() and abort if not.	2024-06-16 06:12:06 -10:00
Tejun Heo	66901e2b44	compat: Drop __COMPAT_scx_bpf_dump() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_dump(). The open helper macros now check the existence of scx_bpf_dump_bstr() and abort if not. While at it, reorder the min requirement checks so that newly added ones are up top to make testing easier.	2024-06-16 06:02:47 -10:00
Tejun Heo	0d8adf2260	compat: Drop __COMPAT_scx_bpf_exit() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_exit(). The open helper macros now check the existence of scx_bpf_exit_bstr() and abort if not.	2024-06-15 20:36:17 -10:00
Tejun Heo	5b5e5be906	compat: Drop __COMPAT_SCX_KICK_IDLE In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros now check the existence of SCX_KICK_IDLE and abort if not.	2024-06-15 20:24:15 -10:00
Tejun Heo	b730f35e68	scx/common.h: Improve SCX_BUG() macro There's no guarantee that errno is set or contains relevant information when SCX_BUG() is invoked. This sometimes leads to "task failed successfully" messages: # ./scx_simple ../scheds/c/scx_simple.c:72 [scx panic]: Success SCX_OPS_SWITCH_PARTIAL missing, kernel too old? While not critical, it's not great. Let's update it so that errno is printed in parentheses when non-zero and match the tag to the macro name so that what's printed is the following: # ./scx_simple [SCX_BUG] ../scheds/c/scx_simple.c:72 SCX_OPS_SWITCH_PARTIAL missing, kernel too old?	2024-06-15 20:17:32 -10:00
Tejun Heo	7c9aedaefe	compat: Drop __COMPAT_scx_bpf_switch_all() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.	2024-06-15 20:03:37 -10:00
Tejun Heo	dd6255a601	Merge pull request #359 from sched-ext/htejun/cosmetic common.bpf.h: Cosmetic changes	2024-06-15 06:42:00 -10:00
Andrea Righi	cb20a6f136	scx_rlfifo: dispatch all tasks on the first CPU available With commit `786ec0c0` ("scx_rlfifo: schedule all tasks in user-space") all the scheduling decisions are now happening in user-space. This also bypasses the built-in idle selection logic, delegating the CPU selection for each task to the user-space scheduler. The easiest way to distribute tasks across the available CPUs is to simply allow to dispatch them on the first CPU available. In this way the scheduler becomes usable in practical scenarios and at the same time it also maintains its simplicity. This allows to spread all tasks across all the available CPUs Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:13:53 +02:00
Andrea Righi	786ec0c04a	scx_rlfifo: schedule all tasks in user-space Disable all the BPF optimization shortcuts by default and force all tasks to be processed by the user-space scheduler. Given that the primary goal of this scheduler is to offer a straightforward and intuitive example for experimental purposes, this change simplifies the process for individuals looking to experiment, allowing them to apply changes to user-space code and quickly observe the effects, without dealing with any in-kernel optimizations. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:07:39 +02:00
Andrea Righi	59f47d6659	scx_rlfifo: improve code readability No functional change, just add some comments to better describe the parameters used when initializing the main BpfScheduler object. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:05:28 +02:00
Tejun Heo	d3b34d1df7	scx_qmap: Rename central_timer to monitor_timer The name was copied from scx_central.bpf.c and doesn't match what the timer is used for in scx_qmap.bpf.c.	2024-06-14 16:07:20 -10:00
Tejun Heo	13abb6fd26	scx/common.bpf.h: Reorganize Currently, the BPF declarations and generic helpers are in the same section. Let's move the generic helpers down to its own section.	2024-06-14 15:36:00 -10:00
Tejun Heo	d7677e3e5c	scx/common.bpf.h: Rename bpf_log2[l]() to u32/64_log2() The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and bpf_log2l() to u64_log2(). While at it, relocate them below compiler directive helpers.	2024-06-14 15:22:39 -10:00
Tejun Heo	5a2412c211	scx/common.bpf.h: Minor comment updates	2024-06-14 15:22:29 -10:00
Andrea Righi	8c6fe540eb	scx_rustland: prevent excessive starvation when system is congested Keep track of the maximum vruntime among all tasks and flush them if the difference between the maximum and minimum vruntime exceeds slice_ns. This helps to prevent excessive starvation, as every task is guaranteed to be dispatched within the slice_ns time limit. Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-14 20:09:19 +02:00
Changwoo Min	94a39f419f	scx_lavd: add the design of core compaction The core compaction seems to work great in various hardware. Now it is time to document its design. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-14 11:53:52 +09:00
Changwoo Min	5068d75bf3	Merge pull request #351 from multics69/lavd-power-v2 scx_lavd: improve CPU frequency scaling	2024-06-14 09:29:10 +09:00
Tejun Heo	a3342810c7	Merge pull request #352 from dschatzberg/mitosis common: Add css iter forward declares	2024-06-13 06:50:06 -10:00
Dan Schatzberg	114e4b644b	common: Add css iter forward declares These are used in mitosis, but they belong in common code so other schedulers can do css iteration. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-12 15:02:48 -07:00
Changwoo Min	747bf2a7d7	scx_lavd: add the design of CPU frequency scaling Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-13 01:42:19 +09:00
Changwoo Min	2e74b86b4a	scx_lavd: logging cpu performance target Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-13 00:44:04 +09:00
Changwoo Min	e6348a11e9	scx_lavd: improve frequency scaling logic The old logic for CPU frequency scaling is that the task's CPU performance target (i.e., target CPU frequency) is checked every tick interval and updated immediately. Indeed, it samples and updates a performance target every tick interval. Ultimately, it fluctuates CPU frequency every tick interval, resulting in less steady performance. Now, we take a different strategy. The key idea is to increase the frequency as soon as possible when a task starts running for quick adoption to load spikes. However, if necessary, it decreases gradually every tick interval to avoid frequency fluctuations. In my testing, it shows more stable performance in many workloads (games, compilation). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 23:40:40 +09:00
Changwoo Min	753f333c09	scx_lavd: refactoring do_update_sys_stat() Originally, do_update_sys_stat() simply calculated the system-wide CPU utilization. Over time, it has evolved to collect all kinds of system-wide, periodic statistics for decision-making, so it has become bulky. Now, it is time to refactor it for readability. This commit does not contain functional changes other than refactoring. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 21:15:25 +09:00
Changwoo Min	9d129f0afa	scx_lavd: rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS The periodic CPU utilization routine does a lot of other work now. So we rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 20:06:17 +09:00
Changwoo Min	7046b47b9c	scx_lavd: properly calculate task's runtime after suspend/resume When a device is suspended and resumed, the suspended duration is added up to a task's runtime if the task was running on the CPU. After the resume, the task's runtime is incorrectly long and the scheduler starts to recognize the system is under heavy load. To avoid such problem, the suspended duration is measured and substracted from the task's runtime. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 15:58:41 +09:00
Dan Schatzberg	b95cfb0772	mitosis: Fix build The target wasn't dependent on the previous sched so building all schedulers ended up not building scx_mitosis which broke the install script.	2024-06-11 14:33:32 -07:00
Dan Schatzberg	9528d4603e	Merge pull request #339 from dschatzberg/mitosis scheds: Add scx_mitosis scheduler	2024-06-11 16:50:25 -04:00
Dan Schatzberg	3b6e2dee20	scheds: Add scx_mitosis scheduler scx_mitosis is a dynamic affinity scheduler which assigns cgroups to Cells and Cells to discrete sets of CPUs. The number of cells is dynamic as is the CPU assignment. BPF mostly just does vtime scheduling for each cell, tracks load, and responds to reconfiguration from userspace. Userspace makes decisions about how to assign cgroups to cells and cells to cpus. This is not yet a complete scheduler, much of the userspace logic is a placeholder as I experiment with better logic. I also want to add richer scheduling semantics to userspace, e.g. so that cells can do more "soft-affinity" rather than the strict partitioning implemented currently. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-11 10:34:53 -07:00
David Vernet	1dbf874709	Merge pull request #341 from vax-r/rusty_data_races scx_rusty: Elimate data races possibility for domain min_vruntime	2024-06-11 12:04:40 -05:00
David Vernet	b50ba626cc	uei: Pass skel to RESIZE_ARRAY() The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable. This is bad practice and can cause issues in other macros that use it. Let's update it to explicitly take a skel argument. Signed-off-by: David Vernet <void@manifault.com>	2024-06-11 10:15:26 -05:00
I Hsin Cheng	4e30bb9ccf	scx_rusty: Elimate data races possibility for domain min_vruntime READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should be able to utilize the macros to get around the possibility of data races for domc->min_vruntime. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-11 10:57:03 +08:00
Tejun Heo	30f27d99d9	Merge pull request #340 from sched-ext/htejun/layered-updates scx_layered: Improve yield, preemption and other behaviors	2024-06-10 11:27:44 -10:00
Tejun Heo	9ec3594b4f	scx_layered: Several fixes to address David's review - pick_idle_cpu() was putting idle_smtmask that it didn't acquire. - layered_enqueue() was unnecessarily entering preemption path after finding an idle CPU. - No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They never do. - Relocate cctx->yielding test into keep_runinng() from its caller.	2024-06-10 11:23:37 -10:00
Tejun Heo	92317aa2f9	Use __always_inline uniformly Instead of using __attribute__((always_inline)) use the __always_inline macro provided by BPF.	2024-06-10 11:23:26 -10:00
Changwoo Min	472ab945b8	scx_lavd: core compaction for low power consumption (#338 ) scx_lavd: core compaction for low power consumption When system-wide CPU utilization is low, it is very likely all the CPUs are running with very low utilization. That means all CPUs run with low clock frequency thanks to dynamic frequency scaling and very frequently go in and out from/to C-state. That results in low performance (i.e., low clock frequency) and high power consumption (i.e., frequent P-/C-state transition). The idea of core compaction is using less number of CPUs when system-wide CPU utilization is low. The chosen cores (called "active cores") will run in higher utilization and higher clock frequency, and the rest of the cores (called "idle cores") will be in a C-state for a much longer duration. Thus, the core compaction can achieve higher performance with lower power consumption. One potential problem of core compaction is latency spikes when all the active cores are overloaded. A few techniques are incorporated to solve this problem. 1) Limit the active CPU core's utilization below a certain limit (say 50%). 2) Do not use the core compaction when the system-wide utilization is moderate (say 50%). 3) Do not enforce the core compaction for kernel and pinned user-space tasks since they are manually optimized for performance. In my experiments, under a wide range of system-wide CPU utilization (5%—80%), the core compaction reduces 7-30% power consumption without sacrificing average and 99p tail latency. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-08 09:25:27 +09:00
Tejun Heo	a165970ab9	scx_layered: Add migration statistic Keep track of how frequent migrations are.	2024-06-07 11:49:39 -10:00
Tejun Heo	5b31d96c3d	scx_layered: Implement "preempt_first" layer property If set, tasks in the layer will try to preempt tasks in their previous CPUs before trying to find idle CPUs.	2024-06-07 11:49:39 -10:00
Tejun Heo	ece3638664	scx_layered: Allow confined layers to preempt There's no reason to restrict confined layers from preempting on the CPUs that they are entitled to. Allow preemption for confined layers.	2024-06-07 11:49:39 -10:00
Tejun Heo	7c48814ed0	scx_layered: Prefer preempting the CPU the task was previously on Currently, when preempting, searching for the candidate CPU always starts from the RR preemption cursor. Let's first try the previous CPU the preempting task was on as that may have some locality benefits.	2024-06-07 11:49:38 -10:00
Tejun Heo	3db3257911	scx_layered: Find and kick an idle CPU from enqueue path When a task is being enqueued outside wakeup path, ops.select_cpu() isn't called, so we can end up in a situation where a newly enqueued task keeps waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU selection path into pick_idle_cpu() and call it from the enqueue path in such cases. This problem is shared across schedulers and likely needs a more generic solution in the future.	2024-06-07 11:49:38 -10:00
Tejun Heo	0f2d1ad2fa	scx_layered: Implement a new layer parameter "yield_ignore" yield(2) currently gives up the entire slice. Add "yield_ignore" layer parameter which can modulate the magnitude of yiedling. When 1.0, yields are completely ignored. 0.5, only half worth of the full slice is given up and so on.	2024-06-07 11:49:38 -10:00
Tejun Heo	4aa8124b9c	scx_layered: Add explicit yield() support Currently, a task which yields is treated the same as a task which has run out its slice. As the budget charged to a task is calculated from wall clock time, a repeatedly yielding task can stay at the top of the queue for quite a while hogging the CPU and spiking the number of scheduling events. Let's add explicit yield support. An yielding task is now always charged the full slice and not allowed to keep running on the same CPU.	2024-06-07 11:49:38 -10:00
Tejun Heo	436cd7ba9e	scx_layered: Make enqueue path comprehensive and handle CPU preemptions The keep_running path relies on the implicit last task enqueue which makes the statistics a bit difficult to track. Let's make the enqueue path comprehensive: - Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly. - Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted by a higher pri sched class and handle the re-enqueued tasks explicitly in layered_enqueue(). - Add more statistics to track all enqueue operations.	2024-06-07 11:49:38 -10:00
Tejun Heo	4a0993ceab	scx_layered: Allow long-running tasks to keep running on the same CPU When a task exhausts its slice, layered currently doesn't make any effort to keep it on the same CPU. It dispatches the next task to run and then enqueues the running one. This leads to suboptimal behaviors. e.g. When this happens to a task in a preempting layer, the task will most likely find an idle CPU or a task to preempt and then migrate there causing a completely unnecessary migration. This patch layered_dispatch() test whether the current task should keep running on the CPU and then skip dispatching to keep the task running. This behavior depends on the implicit local DSQ enqueue mechanism which triggers when there are no other tasks to run.	2024-06-07 11:49:38 -10:00
Tejun Heo	200af60f2a	scx_layered: Fix load failure due to scheduler_tick() -> sched_tick() rename - scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care about the type of the symbol. - scx_layered: Fix load failure on kernels >= v6.10-rc due to scheduler_tick() -> sched_tick rename. Attach the tick fentry function to either scheduler_tick() or sched_tick().	2024-06-06 12:54:59 -10:00
Andrea Righi	8a3ee7b801	scx_rustland: never use a time slice that exceeds the default value Make sure to never assign a time slice longer than the default time slice, that can be used as an upper limit. This seems to prevent potential stall conditions (reported by the CachyOS community) when running CPU-intensive workloads, such as: [ 68.062813] sched_ext: BPF scheduler "rustland" errored, disabling [ 68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s) [ 68.062832] scx_watchdog_workfn+0x154/0x1e0 [ 68.062837] process_one_work+0x18e/0x350 [ 68.062839] worker_thread+0x2fa/0x490 [ 68.062841] kthread+0xd2/0x100 [ 68.062842] ret_from_fork+0x34/0x50 [ 68.062844] ret_from_fork_asm+0x1a/0x30 Fixes: `6f4cd853` ("scx_rustland: introduce virtual time slice") Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com> Tested-by: Piotr Gorski <piotrgorski@cachyos.org> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-06 17:56:23 +02:00
Andrea Righi	6f4cd853f9	scx_rustland: introduce virtual time slice Overview ======== Currently, a task's time slice is determined based on the total number of tasks waiting to be scheduled: the more overloaded the system, the shorter the time slice. This approach can help to reduce the average wait time of all tasks, allowing them to progress more slowly, but uniformly, thus providing a smoother overall system performance. However, under heavy system load, this approach can lead to very short time slices distributed among all tasks, causing excessive context switches that can badly affect soft real-time workloads. Moreover, the scheduler tends to operate in a bursty manner (tasks are queued and dispatched in bursts). This can also result in fluctuations of longer and shorter time slices, depending on the number of tasks still waiting in the scheduler's queue. Such behavior can also negatively impact on soft real-time workloads, such as real-time audio processing. Virtual time slice ================== To mitigate this problem, introduce the concept of virtual time slice: the idea is to evaluate the optimal time slice of a task, considering the vruntime as a deadline for the task to complete its work before releasing the CPU. This is accomplished by calculating the difference between the task's vruntime and the global current vruntime and use this value as the task time slice: task_slice = task_vruntime - min_vruntime In this way, tasks that "promise" to release the CPU quickly (based on their previous work pattern) get a much higher priority (due to vruntime-based scheduling and the additional priority boost for being classified as interactive), but they are also given a shorter time slice to complete their work and fulfill their promise of rapidity. At the same time tasks that are more CPU-intensive get de-prioritized, but they will tend to have a longer time slice available, reducing in this way the amount of context switches that can negatively affect their performance. In conclusion, latency-sensitive tasks get a high priority and a short time slice (and they can preempt other tasks), CPU-intensive tasks get low priority and a long time slice. Example ======= Let's consider the following theoretical scenario: task \| time -----+----- A \| 1 B \| 3 C \| 6 D \| 6 In this case task A represents a short interactive task, task C and D are CPU-intensive tasks and task B is mainly interactive, but it also requires some CPU time. With a uniform time slice, scaled based on the amount of tasks, the scheduling looks like this (assuming the time slice is 2): A B B C C D D A B C C D D C C D D \| \| \| \| \| \| \| \| \| `---`---`---`-`-`---`---`---`----> 9 context switches With the virtual time slice the scheduling changes to this: A B B C C C D A B C C C D D D D D \| \| \| \| \| \| \| `---`-----`-`-`-`-----`----------> 7 context switches In the latter scenario, tasks do not receive the same time slice scaled by the total number of tasks waiting to be scheduled. Instead, their time slice is adjusted based on their previous CPU usage. Tasks that used more CPU time are given longer slices and their processing time tends to be packed together, reducing the amount of context switches. Meanwhile, latency-sensitive tasks can still be processed as soon as they need to, because they get a higher priority and they can preempt other tasks. However, they will get a short time slice, so tasks that were incorrectly classified as interactive will still be forced to release the CPU quickly. Experimental results ==================== This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070. The test case involves the usual benchmark of playing a video game while simultaneously overloading the system with a parallel kernel build (`make -j32`). The average frames per second (fps) reported by Steam is used as a metric for measuring system responsiveness (the higher the better): Game \| before \| after \| delta \| ---------------------------+---------+---------+--------+ Baldur's Gate 3 \| 40 fps \| 48 fps \| +20.0% \| Counter-Strike 2 \| 8 fps \| 15 fps \| +87.5% \| Cyberpunk 2077 \| 41 fps \| 46 fps \| +12.2% \| Terraria \| 98 fps \| 108 fps \| +10.2% \| Team Fortress 2 \| 81 fps \| 92 fps \| +13.6% \| WebGL demo (firefox) [1] \| 32 fps \| 42 fps \| +31.2% \| ---------------------------+---------+---------+--------+ Apart from the massive boost with Counter-Strike 2 (that should be taken with a grain of salt, considering the overall poor performance in both cases), the virtual time slice seems to systematically provide a boost in responsiveness of around +10-20% fps. It also seems to significantly prevent potential audio cracking issues when the system is massively overloaded: no audio cracking was detected during the entire run of these tests with the virtual deadline change applied. [1] https://webglsamples.org/aquarium/aquarium.html Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-04 23:01:13 +02:00
Tejun Heo	e556dd375d	scx: Unify loading and running boilerplate across rust schedulers Make restart handling with user_exit_info simpler and consistently use the load and report macros consistently across the rust schedulers. This makes all schedulers automatically handle auto restarts from CPU hotplug events. Note that this is necessary even for scx_lavd which has CPU hotplug operations as CPU hotplug operations which took place between skel open and scheduler init can still trigger restart.	2024-06-03 12:25:41 -10:00
David Vernet	a26d3f2220	Merge pull request #328 from sched-ext/rusty_cpumask_overlap rusty: Use cpumask kfuncs in cpumask_intersects_domain()	2024-06-03 20:42:11 +00:00
David Vernet	0ae676a9ca	rusty: Use cpumask kfuncs in cpumask_intersects_domain() In cpumask_intersects_domain(), we check whether a given cpumask has any CPUs in common with the specified domain by looking at the const, static dom_cpumasks map. This map is only really necessary when creating the domain struct bpf_cpumask objects at scheduler load time. After that, we can just use the actual struct bpf_cpumask object embedded in the domain context. Let's use that and cpumask kfuncs instead. This allows rusty to load with https://github.com/sched-ext/sched_ext/pull/216. Signed-off-by: David Vernet <void@manifault.com>	2024-06-03 15:01:19 -05:00
Tejun Heo	a2d5310cb6	Bump versions for a release	2024-06-03 08:35:21 -10:00
Andrea Righi	ccef4d0ba1	scx_rustland: get rid of --builtin-idle option Commit `23b0bb5f` ("scx_rustland: dispatch interactive tasks on any CPU") allows only interactive tasks to be dispatched on any CPU, enabling them to quickly use the first idle CPU available. Non-interactive tasks, on the other hand, are kept on the same CPU as much as possible. This change deprioritizes CPU-intensive tasks further, but it also helps to exploit cache locality, while latency-sensitive tasks are dispatched sooner, improving overall responsiveness, despite the potential migration cost. Given this new logic, the built-idle option, which forces all tasks to be dispatched on the CPU assigned during select_cpu(), no longer offers significant benefits. It would merely reduce the responsiveness of interactive tasks. Therefore, simply remove this option, allowing the scheduler to determine the target CPU(s) for all tasks based on their nature. Fixes: `23b0bb5f` ("scx_rustland: dispatch interactive tasks on any CPU") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-03 10:02:04 +02:00
I Hsin Cheng	0921fde1f1	scx_lavd: Adding READ_ONCE()/WRITE_ONCE() macros In order to prevent compiler from merging or refetching load/store operations or unwanted reordering, we take the implemetation of READ_ONCE()/WRITE_ONCE() from kernel sources under "/include/asm-generic/rwonce.h". Use WRITE_ONCE() in function flip_sys_cpu_util() to ensure the compiler doesn't perform unnecessary optimization so the compiler won't make incorrect assumptions when performing the operation of modifying of bit flipping. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-01 11:07:52 +08:00
Tejun Heo	ebae7d5e6a	Merge pull request #312 from sched-ext/htejun/layered-updates scx_layered: Improve affn_viol handling and implement dump method	2024-05-28 10:22:31 -10:00
Tejun Heo	d3ed4cb5c7	scx_layered: Successfully consuming from HI_FALLBACK_DSQ should terminate dispatching layered_dispatch() was incorrectly continuing down to the lower priority DSQs after successfully consuming from HI_FALLBACK_DSQ which can lead to latency issues. Fix it.	2024-05-28 10:20:55 -10:00
Changwoo Min	4c0f996ddc	Revert "scx_lavd: Enforce memory barrier in flip_sys_cpu_util"	2024-05-27 12:19:21 +09:00
Changwoo Min	0371ccae40	Merge pull request #318 from vax-r/Memory_barrier scx_lavd: Enforce memory barrier in flip_sys_cpu_util	2024-05-26 21:00:25 +09:00
I Hsin Cheng	f839106a57	scx_lavd: Enforce memory barrier in flip_sys_cpu_util Use the GNU built-in __sync_fetch_and_xor() to perform the XOR operation on global variable "__sys_cpu_util_idx" to ensure the operations visibility. The built-in function "__sync_fetch_and_xor()" can provide both atomic operation and full memory barrier which is needed by every operation (especially store operation) on global variables. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-05-26 15:27:10 +08:00
I Hsin Cheng	5881c61a5e	scx_central: Provide backward compability Newer sched_ext kernel versions sets the scheduler to schedule all tasks within the system by default. However, some users are using the old versions of kernel. Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to "SCHED_EXT" class so scx_central would schedule all tasks by default in older kernels.	2024-05-24 15:12:34 +08:00
Tejun Heo	99eb56b6b5	scx_layered: Implement layered_dump() which dumps layer states.	2024-05-23 12:54:17 -10:00
Tejun Heo	a576242b69	scx_layered: Open and grouped layers can handle tasks with custom affinities The main reason why custom affinities are tricky for scx_layered is because if we put a task which doesn't allow all CPUs into a layer's DSQ, it may not get consumed for an indefinite amount of time. However, this is only true for confined layers. Both open and grouped layers always consumed from all CPUs and thus don't have this risk. Let's allow tasks with custom affinities in open and grouped layers. - In select_cpu(), don't consider direct dispatching to a local DSQ as affinity violation even if the target CPU is outside the layer's cpumask if the layer is open. - In enqueue(), separate out per-cpu kthread special case into its own block. Note that this is only applied if the layer is not preempting as a preempting layer has a higher priority than HI_FALLBACK_DSQ anyway. - Trigger the LO_FALLBACK_DSQ path for other threads only if the layer is confined. - The preemption path now also runs for tasks with a custom affinity in open and grouped layers. Update it so that it only considers the CPUs in the preempting task's allowed cpumask. (cherry picked from commit 82d2f887a4608de61ddf5e15643c10e504a88f7b)	2024-05-23 12:54:17 -10:00
Tejun Heo	1ce23760b5	scx_layered: Improve affinity violation handling - AFFN_VIOL for per-cpu tasks could be double counted. Once in select_cpu() and again in enqueue(). Count in select_cpu() only when direct dispatching. - Violating tasks were prioritized over non-violating ones because they were queued on SCX_DSQ_GLOBAL which has priority over all user DSQs. This doesn't make sense. Let's introduce two fallback DSQs - HI_FALLBACK_DSQ and LO_FALLBACK_DSQ. HI is used for violating kthreads and LO for violating user threads. HI is dispatched after preempting layers and LO after all other layers. This shouldn't change the behavior too much for kthreads while punshing, rather than rewarding, violating user threads. (cherry picked from commit 67f69645667ba8a155cae9a9b7e90c055d39e23c)	2024-05-23 12:54:17 -10:00
Andrea Righi	23b0bb5ff5	scx_rustland: dispatch interactive tasks on any CPU Dispatch non-interactive tasks on the CPU selected by the built-in idle selection logic and allow interactive tasks to be dispatched on any CPU. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-22 12:12:55 +02:00
Andrea Righi	3be3b91c29	scx_rustland: assign effective time slice to all tasks Do not always assign the maximum time slice to interactive tasks, but use the same value of the dynamic time slice for everyone. This seems to prevent potential audio cracking when the system is over commissioned. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-22 12:12:55 +02:00
Andrea Righi	cca84479f8	scx_rustland: ignore built-in selection logic with --full-user The option --full-user is provided to delegate all scheduling decisions to the user-space scheduler with no exception, including the idle selection logic. Therefore, make this option incompatible with --builtin-idle and completely bypass the built-in idle selection logic when running in full-user mode. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-22 09:02:02 +02:00
Andrea Righi	9e4bea4a1c	scx_rustland_core: switch to FIFO when system is underutilized Provide a knob in scx_rustland_core to automatically turn the scheduler into a simple FIFO when the system is underutilized. This choice is based on the assumption that, in the case of system underutilization (less tasks running than the amount of available CPUs), the best scheduling policy is FIFO. With this option enabled the scheduler starts in FIFO mode. If most of the CPUs are busy (nr_running >= num_cpus - 1), the scheduler immediately exits from FIFO mode and starts to apply the logic implemented by the user-space component. Then the scheduler can switch back to FIFO if there are no tasks waiting to be scheduled (evaluated using a moving average). This option can be enabled/disabled by the user-space scheduler using the fifo_sched parameter in BpfScheduler: if set, the BPF component will periodically check for system utilization and switch back and forth to FIFO mode based on that. This allows to improve performance of workloads that are using a small amount of the available CPUs in the system, while still maintaining the same good level of performance for interactive tasks when the system is over commissioned. In certain video games, such as Baldur's Gate 3 or Counter-Strike 2, running in "normal" system conditions, we can experience a boost in fps of approximately 4-8% with this change applied. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-22 09:02:02 +02:00
Andrea Righi	0d75c80587	Revert "Merge pull request #305 from sched-ext/rustland-fifo-mode" This merge included additional commits that were supposed to be included in a separate pull request and have nothing to do with the fifo-mode changes. Therefore, revert the whole pull request and create a separate one with the correct list of commits required to implement this feature. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-22 09:00:25 +02:00
Andrea Righi	f38d91bf29	scx_rustland: dispatch interactive tasks on any CPU Dispatch non-interactive tasks on the CPU selected by the built-in idle selection logic and allow interactive tasks to be dispatched on any CPU. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-21 18:08:43 +02:00
Andrea Righi	6901ddb150	scx_rustland: assign effective time slice to all tasks Do not always assign the maximum time slice to interactive tasks, but use the same value of the dynamic time slice for everyone. This seems to prevent potential audio cracking when the system is over commissioned. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-21 18:08:12 +02:00
Andrea Righi	d25675ff44	scx_rustland_core: switch to FIFO when system is underutilized Provide a knob in scx_rustland_core to automatically turn the scheduler into a simple FIFO when the system is underutilized. This choice is based on the assumption that, in the case of system underutilization (less tasks running than the amount of available CPUs), the best scheduling policy is FIFO. With this option enabled the scheduler starts in FIFO mode. If most of the CPUs are busy (nr_running >= num_cpus - 1), the scheduler immediately exits from FIFO mode and starts to apply the logic implemented by the user-space component. Then the scheduler can switch back to FIFO if there are no tasks waiting to be scheduled (evaluated using a moving average). This option can be enabled/disabled by the user-space scheduler using the fifo_sched parameter in BpfScheduler: if set, the BPF component will periodically check for system utilization and switch back and forth to FIFO mode based on that. This allows to improve performance of workloads that are using a small amount of the available CPUs in the system, while still maintaining the same good level of performance for interactive tasks when the system is over commissioned. In certain video games, such as Baldur's Gate 3 or Counter-Strike 2, running in "normal" system conditions, we can experience a boost in fps of approximately 4-8% with this change applied. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-21 17:39:11 +02:00
I Hsin Cheng	e605b067c6	scx_flatcg: Correct content error in comment A's share in the hierarchy should be 100/(200+100), plus 200/(200+100) doesn't equal to 1/3. Correct the mistake by changing "200" to "100".	2024-05-21 13:27:26 +08:00
Andrea Righi	a835ab0402	Merge pull request #299 from sched-ext/rustland-cleanups scx_rustland: cleanups	2024-05-20 18:50:30 +02:00
Tejun Heo	0181df54b5	Merge pull request #303 from sched-ext/simple_comment simple: Add comment explaining use of SHARED_DSQ	2024-05-20 06:45:13 -10:00
David Vernet	0dda4badd5	simple: Add comment explaining use of SHARED_DSQ scx_simple is a basic scheduler that does either basic vtime, or global FIFO, scheduling. At first glance, it may be confusing why we create a separate DSQ rather than just using SCX_DSQ_GLOBAL. Let's add a comment explaining the reason for this, so that users that are going over scx_simple as an example scheduler don't get confused. Signed-off-by: David Vernet <void@manifault.com>	2024-05-20 08:48:31 -05:00
Andrea Righi	9a2cc6be50	scx_rustland: report nr_running metric to stdout Report the amount of running tasks to stdout. This value also represents the amount of active CPUs that are currently executing a task. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-20 05:20:46 +02:00
Andrea Righi	aae4ed5b46	scx_rustland: fix coding style Small coding style changes found by rustfmt (no functional change). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-20 05:20:46 +02:00
Andrea Righi	c5a4a01994	scx_simple: re-add __COMPAT_scx_bpf_switch_all() Although newer kernels default to switching-all, some users might still be using the scheduler with older kernels. Therefore, ensure all tasks are moved to the SCHED_EXT class by calling __COMPAT_scx_bpf_switch_all() during init, so that scx_simple can still operate on these older kernels as well. Fixes: `cf66e58` ("Sync from kernel (670bdab6073)") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-20 04:50:16 +02:00
Andrea Righi	b1ab9c7418	scx_rustland: get rid of the dynamic slice boost The dynamic slice boost is not used anymore in the code, so there is no reason to keep evaluating it. Moreover, using it instead of the static slice boost seems to make things worse, so let's just get rid of it. Fixes: `0b3c399` ("scx_rustland: introduce dynamic slice boost") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-19 07:51:26 +02:00
David Vernet	17c0c10b4e	Merge pull request #294 from sched-ext/fix_warnings Fix warnings	2024-05-18 10:47:54 -05:00
Changwoo Min	4cba06dc33	scx_lavd: fix inconsistent indentation in main.bpf.c Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-18 22:22:16 +09:00
David Vernet	a1c60ce589	lavd: Remove unused variables from scx_lavd Fix unused variable warnings. Signed-off-by: David Vernet <void@manifault.com>	2024-05-18 07:51:20 -05:00
David Vernet	ee940bd8b5	rustland: Mark get_cpu_owner() as __maybe_unused scx_rustland has a function called get_cpu_owner() in BPF which currently has no callers. There's nothing wrong with the function, but it causes a warning due to an unused function. Let's just annotate it with __maybe_unused to tell the compiler that it's not a problem. Signed-off-by: David Vernet <void@manifault.com>	2024-05-18 07:51:20 -05:00
David Vernet	df42589a76	rusty: Fix bugs in rusty When building with warnings enabled, a few obvious bugs are pointed out: - We're not correctly calculating waker frequency - We're not taking the min of avg_run_raw compared to max latency - We're missing an element from sched_prio_to_weight Fix these. With these changes, interactivity is seemingly improved. We go from ~12 sec / turn -> 11 seconds / turn in the Civ 6 AI benchmark with a 4 x nproc CPU hogging workload in the background. It's clear, however, that we really need preemption. Signed-off-by: David Vernet <void@manifault.com>	2024-05-18 07:51:20 -05:00
David Vernet	61cbfdf912	layered: Remove unused variables There are some unused variables in scx_layered. Remove them. Signed-off-by: David Vernet <void@manifault.com>	2024-05-18 07:51:20 -05:00
David Vernet	b421cee59e	Merge pull request #291 from sched-ext/htejun/sync-kernel Sync from kernel (73f4013eb1eb)	2024-05-17 20:43:00 -05:00
Tejun Heo	ab25992416	Add missing skel.attach() calls C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling .attach() and were only attaching the struct_ops. This meant that all non-struct_ops BPF programs contained in the skels were never attached which breaks e.g. scx_layered. Let's fix it by adding .attach() invocation the the attach macros.	2024-05-17 14:33:04 -10:00
Tejun Heo	e26fba9255	Sync from kernel (73f4013eb1eb) This pulls in the support for dump ops.	2024-05-17 01:57:36 -10:00
David Vernet	c1f1411c7a	Merge pull request #289 from sched-ext/rusty_hot_plug Add remaining hotplug pieces	2024-05-16 13:42:11 -06:00
Andrea Righi	42cee1c2dd	Merge pull request #286 from sched-ext/rustland-low-power-mode scx_rustland: introduce low power mode	2024-05-16 08:28:32 +02:00
I Hsin Cheng	6cce01c66b	Avoid redundant substraction in rsigmoid_u64 Originally the implementation of function rsigmoid_u64 will perform substraction even when the value of "v" equals to the value of "max" , in which the result is certainly zero. We can avoid this redundant substration by changing the condition from ">" to ">=" since we know when the value of "v" and "max" are equal we can return 0 without any substract operation.	2024-05-16 11:58:39 +08:00
David Vernet	27d2490b1e	rusty: Use scx_ops_open!() in scx_rusty Now that the scx_ops_open!() macro is available, let's use it in scx_rusty to cover all cases of when hotplug can happen. Signed-off-by: David Vernet <void@manifault.com>	2024-05-15 16:42:59 -05:00
David Vernet	34818de54d	rusty: Use built-in exit code for restarting Now that the kernel exports the SCX_ECODE_ACT_RESTART exit code, we can remove the custom hotplug logic from scx_rusty, and instead rely on the built-in logic from the kernel. There's still a corner case that we're not honoring: when a hotplug event happens on the init path. A future change will address this as well. Signed-off-by: David Vernet <void@manifault.com>	2024-05-15 16:31:56 -05:00
Andrea Righi	e9ac6105c7	scx_rustland_core: introduce low-power mode Introduce a low-power mode to force the scheduler to operate in a very non-work conserving way, causing a significant saving in terms of power consumption, while still providing a good level of responsiveness in the system. This option can be enabled in scx_rustland via the --low_power / -l option. The idea is to not immediately re-kick a CPU when it enters an idle state, but do that only if there are no other tasks running in the system. In this way, latency-critical tasks can be still dispatched immediately on the other active CPUs, while CPU-bound tasks will be forced to spend more time waiting to be scheduled, basically enforcing a special CPU throttling mechanism that affects only the tasks that are not latency critical. The consequence is a reduction in the overall system throughput, but also a significant reduction of power consumption, that can be useful for mobile / battery-powered devices. Test case (using `scx_rustland -l`): - play a video game (Terraria) while recompiling the kernel - measure game performance (fps) and core power consumption (W) - compare the result of normal mode vs low-power mode Result: Game performance \| Power consumption \| ------------+-----------------+-------------------+ normal mode \| 60 fps \| 6W \| low-power mode \| 60 fps \| 3W \| As we can see from the result the reduction of power consumption is quite significant (50%), while the responsiveness of the game (fps) remains the same, that means battery life can be potentially doubled without significantly affecting system responsiveness. The overall throughput of the system is, of course, affected in a negative way (kernel build is approximately 50% slower during this test), but the goal here is to save power while still maintaining a good level of responsiveness in the system. For this reason the low-power mode should be considered only in emergency conditions, for example when the system is close to completely run out of power or simply to extend the battery life of a mobile device without compromising its responsiveness. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-15 20:32:05 +02:00
vax-r	f293995b59	Fix typo Fix the usage of "scheduler" in the comment of main.bpf.c , it should a verb which is "schedule".	2024-05-15 23:02:35 +08:00
Changwoo Min	08e7e23cbe	scx_lavd: priint out the current limitaiton of scx_lavd for users Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-15 12:04:09 +09:00
Changwoo Min	a4560c7f7f	scx_lavd: add comments describing the idea of preemption Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-15 12:04:03 +09:00
Andrea Righi	2a7b1cc3c4	scx_rustland: properly support offline CPUs During the initialization phase the scheduler needs to be aware of all the available CPUs in the system (also those that are offline), in order to create a proper per-CPU DSQ for all of them. Otherwise, if some cores are offline, we may get errors like the following: swapper/7[0] triggered exit kind 1024: runtime error (invalid DSQ ID 0x0000000000000007) Backtrace: scx_bpf_consume+0xaa/0xd0 bpf_prog_42ff1b9d1ac5b184_rustland_dispatch+0x12b/0x187 Change the code to configure the BpfScheduler object with the total amount of CPUs available in the system and prevent such failure. This fixes #280. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-12 08:42:46 +02:00
Andrea Righi	a31bcc6847	scx_rustland: maximize CPU utilization Always dispatch at least one task, even if all the CPUs are busy. This small overcommitment allows to maximize the CPU utilization without introducing bubbles in the scheduling and also without introducing regressions in terms of resposiveness. Before this change the average CPU utilization of a `stress-ng -c 8` on an 8-cores system is around 95%. With this change applied the CPU utilization goes up to a consistent 100%. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-11 16:23:12 +02:00
Andrea Righi	63feba9c2b	topology: TopologyMap: add nr_cpus_online() Add a method to TopologyMap to get the amount of online CPUs. Considering that most of the schedulers are not handling CPU hotplugging it can be useful to expose also this metric in addition to the amount of available CPUs in the system. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-10 17:24:20 +02:00
Andrea Righi	f052493005	scx_rustland_core: implement effective time slice on a per-task basis Drop the global effective time-slice and use the more fine-grained per-task time-slice to implement the dynamic time-slice capability. This allows to reduce the scheduler's overhead (dropping the global time slice volatile variable shared between user-space and BPF) and it provides a more fine-grained control on the per-task time slice. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-10 17:24:20 +02:00
Changwoo Min	01faf9408b	Merge pull request #274 from multics69/scx-lavd-preemption02 scx_lavd: support yield-based preemption	2024-05-10 11:32:29 +09:00
Changwoo Min	446de3ef3c	scdx_lavd: minor style changes Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-10 11:07:32 +09:00
Changwoo Min	7fcc6e4576	scx_lavd: support yield-based preemption If there is a higher priority task when running ops.tick(), ops.select_cpu(), and ops.enqueue() callbacks, the current running tasks yields its CPU by shrinking time slice to zero and a higher priority task can run on the current CPU. As low-cost, fine-grained preemption becomes available, default parameters are adjusted as follows: - Raise the bar for remote CPU preemption to avoid IPIs. - Increase the maximum time slice. - Gradually enforce the fair use of CPU time (i.e., ineligible duration) Lastly, using CAS, we ensure that a remote CPU is preempted by only one CPU. This removes unnecessary remote preemptions (and IPIs). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-10 00:54:41 +09:00
Andrea Righi	7bc62d8db8	Merge pull request #270 from sched-ext/rustland-user-ringbuffer scx_rustland_core: use a BPF_MAP_TYPE_USER_RINGBUF to dispatch tasks	2024-05-09 06:50:19 +02:00
vax-r	093a08356e	Fix typo Fix "expermentation" to "experimentation".	2024-05-09 12:10:55 +08:00
Andrea Righi	5da4602ad7	scx_rustland_core: use a BPF_MAP_TYPE_USER_RINGBUF to dispatch tasks Replace the BPF_MAP_TYPE_QUEUE with a BPF_MAP_TYPE_USER_RINGBUF to store the tasks dispatched from the user-space scheduler to the BPF component. This eliminates the need of the bpf() syscalls, significantly reducing the overhead of the user-space->kernel communication and delivering a notable performance boost in the overall system throughput. Based on experimental results, this change allows to reduces the scheduling overhead by approximately 30-35% when the system is overcommitted. This improvement has the potential to make user-space schedulers based on scx_rustland_core viable options for real production systems. Link: https://github.com/libbpf/libbpf-rs/pull/776 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-08 22:16:53 +02:00
David Vernet	b9b9875aa7	rusty: Remove task offline tracking scx_rusty's intention is to support hotplug by automatically restarting whenever a hotplug event is encountered. Now that we're not trying to consume a bogus DSQ in the rusty_dispatch() on a newly hotplugged CPU, let's just remove offline tracking. It's really just there as a sanity check, but it triggers if an offline task is made runnable during a hotplug event before the ops.hotplug() callback has been invoked. Signed-off-by: David Vernet <void@manifault.com>	2024-05-04 21:33:55 -05:00
David Vernet	6f1dc6067a	rusty: Check for offline CPU in rusty_dispatch() There's currently a slight issue on existing kernels on the hotplug path wherein we can start to receive scheduling callbacks on a CPU before that CPU has received hotplug events. For CPUs going online, this can possibly confuse a scheduler because it may not be expecting anything to ever happen on that CPU, and therefore may do things that could cause the scheduler to crash. For example, without this patch in scx_rusty, we try to consume from a bogus DSQ that doesn't exist, which causes ext.c to boot out the scheduler. Though this issue will soon be fixed in ext.c, let's explicitly avoid dispatching from an onlining CPU in rusty so that we properly support hotplug on older kernels as well. Signed-off-by: David Vernet <void@manifault.com>	2024-05-04 21:33:54 -05:00
David Vernet	0d6b00238f	common: Add likely/unlikely macros We can hint to the compiler about paths we'll take in a scheduler. This is a common pattern, so lets provide convenience macros. Signed-off-by: David Vernet <void@manifault.com>	2024-05-04 21:33:53 -05:00
David Vernet	4b16f5117a	rusty: Fix alignment Found a misaligned conditional in main.rs. Fix it. Signed-off-by: David Vernet <void@manifault.com>	2024-05-04 21:33:19 -05:00
Changwoo Min	01e5a46371	Merge pull request #263 from multics69/scx_lavd-power01 scx_lavd: support CPU frequency scaling	2024-05-05 10:16:00 +09:00
Changwoo Min	a24e1d7adf	scx_lavd: more comments about CPU frequency scaling Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-04 10:41:13 +09:00
David Vernet	9bb8e9a548	common: Pull bpf_log2l() into helper function header scx_lavd implemented 32 and 64 bit versions of a base-2 logarithm function. This is now also used in rusty. To avoid code duplication, let's pull it into a shared header. Note that there is technically a functional change here as we remove the always inline compiler directive. We instead assume that the compiler will know best whether or not to inline the function. Signed-off-by: David Vernet <void@manifault.com>	2024-05-03 14:50:24 -05:00
David Vernet	2403f60631	rusty: Dynamically scale slice according to system util In user space in rusty, the tuner detects system utilization, and uses it to inform how we do load balancing, our greedy / direct cpumasks, etc. Something else we could be doing but currently aren't, is using system utilization to inform how we dispatch tasks. We currently have a static, unchanging slice length for the runtime of the program, but this is inefficient for all scenarios. Giving a task a long slice length does have advantages, such as decreasing the number of involuntary context switches, decreasing the overhead of preemption by doing it less frequently, possibly getting better cache locality due to a task running on a CPU for a longer amount of time, etc. On the other hand, long slices can be problematic as well. When a system is highly utilized, a CPU-hogging task running for too long can harm interactive tasks. When the system is under-utilized, those interactive tasks can likely find an idle, or under-utilized core to run on. When the system is over-utilized, however, they're likely to have to park in a runqueue. Thus, in order to better accommodate such scenarios, this patch implements a rudimentary slice scaling mechanism in scx_rusty. Rather than having one global, static slice length, we instead have a dynamic, global slice length that can be changed depending on system utilization. When over-utilized, we go with a longer slice length, and vice versa for when the system is under-utilized. With Terraria, this results in roughly a 50% improvement in mean FPS when playing on an AMD Ryzen 9 7950X, while running Spotify, and stress-ng -c $((4 * $(nproc))). Signed-off-by: David Vernet <void@manifault.com>	2024-05-03 14:17:58 -05:00
David Vernet	76618989f8	rusty: Implement basic eligible deadline scheduling in rusty scx_rusty doesn't do terribly well with interactive workloads. In order to improve the situation, this patch adds support for basic deadline scheduling in rusty. This approach doesn't incorporate eligibility, and simply uses a crude avg_runtime tracking approach to scaling a task's deadline. In a series of follow-on changes, we'll update the scheduler to use more indicators for interactivity that affect both slice length, and deadline calculation. Signed-off-by: David Vernet <void@manifault.com>	2024-05-03 14:17:56 -05:00
Changwoo Min	6892898469	scx_lavd: support CPU frequency scaling To know the required CPU performance (e.g., frequency) demand, we keep track of 1) utilization of each CPU and 2) _performance criticality_ of each task. The performance criticality of a task denotes how critical it is to CPU performance (frequency). Like the notion of latency criticality, we use three factors: the task's average runtime, wake-up frequency, and waken-up frequency. A task's runtime is longer, and its two frequencies are higher; the task is more performance-critical because it would be a bottleneck in the middle of the task chain. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-05-04 00:30:25 +09:00
David Vernet	925a69b156	rusty: Use helper to lookup domain context Let's remove the extraneous copy pasting and use a lookup helper like we do for task and pcpu context. Signed-off-by: David Vernet <void@manifault.com>	2024-05-02 13:56:46 -05:00
Daniel Jordan	de2773d621	scx_rusty: compare abs values in xfer_between() A LoadEntity gets the load to transfer between two entities by taking the minimum of their imbalances and reducing its abs value by xfer_ratio. In practice self.imbal(), the push node or domain, always has positive imbalance and other.imbal(), the pull node or domain, always has negative imbalance, so other.imbal() is always the minimum even though the abs value of its imbalance might be greater than the abs value of self.imbal(). It seems like the intent is to take the minimum of the two absolute values instead to avoid overbalancing at the puller, so make both values abs. Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>	2024-05-02 11:54:13 -04:00
Daniel Jordan	1652791e5d	scx_rusty: make per-task loads sensitive to lb_apply_weight Rusty's load balancer calculates load differently based on average system CPU utilization in create_domain_hierarchy(). At >= 99.999% utilization, load is the product of a task's weight and duty cycle; below that, load is the same as the task's duty cycle. populate_tasks_by_load(), however, always uses the product when calculating per-task load so that in the sub-99.999% util case, load is inflated, typically by a factor of 100 with a normal priority task. Tasks look too heavy to migrate as a result because a single task would transfer more load than the domain imbalance allows, leading to significant imbalance in some cases. Make populate_tasks_by_load() calculate task load the same way as domain load, checking lb_apply_weight. Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>	2024-05-02 11:54:05 -04:00
Andrea Righi	11f100f043	scx_rustland: bump up version to 0.0.6 Bump up scx_rustland version to use the new scx_rustland_core crate. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-30 18:32:21 +02:00
Andrea Righi	fd68ce13a7	scx_rustland_core: bump up version to 0.4.0 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-30 18:09:09 +02:00
Tejun Heo	c77d101655	scheds/c: Sync to the new conventions Sync with the in-kernel-tree example schedulers.	2024-04-29 10:13:46 -10:00
Tejun Heo	71d5e60093	scheds/rust: Use __COMPAT helpers instead of open coding feature tests	2024-04-29 09:58:34 -10:00
Tejun Heo	cf66e58118	Sync from kernel (670bdab6073) And fix build breakage in scx_utils due to an enum type rename.	2024-04-29 09:58:19 -10:00
Tejun Heo	e5e88b7e18	Bump versions to prepare for a release	2024-04-29 09:07:27 -10:00
Tejun Heo	3e7ef35649	Merge pull request #250 from multics69/lavd-issue-234 scx_lavd: replesih time slice at ops.running() only when necessary	2024-04-29 09:01:04 -10:00
Tejun Heo	5b7b7d5193	Merge pull request #247 from multics69/lavd-issue-244 scx_lavd: always inline submit_task_ctx to make the verifier happy	2024-04-29 07:53:38 -10:00
Changwoo Min	5f63e0ca30	scx_lavd: replesih time slice at ops.running() only when necessary The current code replenishes the task's time slice whenever the task becomes ops.running(). However, there is a case where such behavior can starve the other tasks, causing the watchdog timeout error. One (if not all) such case is when a task is preempted while running by the higher scheduler class (e.g., RT, DL). In such a case, the task will be transit in a cycle of ops.running() -> ops.stopping() -> ops.running() -> etc. Whenever it becomes re-running, it will be placed at the head of local DSQ and ops.running() will renew its time slice. Hence, in the worst case, the task can run forever since its time slice is never exhausted. The fix is assigning the time slice only once by checking if the time slice is calculated before. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-29 12:13:31 +09:00
Andrea Righi	cabde30736	scx_utils: bump up version to 0.8.0 Bump up scx-utils version to provide the new scx_utils::TopologyMap. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 21:01:16 +02:00
Andrea Righi	5effb4fc4c	scx_rustland: bump up version to 0.0.5 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 12:01:38 +02:00
Andrea Righi	0785246ee2	scx_rustland: provide --version option Provide a command line option to print the version of the scheduler and the scx_rustland_core crate. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 12:01:38 +02:00
Andrea Righi	fb2f5c240e	scx_rustland_core: bump up version to 0.3 Given that rustland_core now supports task preemption and it has been tested successfully, it's worhtwhile to cut a new version of the crate. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 12:01:38 +02:00
Andrea Righi	905960f752	scx_lavd: use c_char consistently In Rust c_char can be aliased to i8 or u8, depending on the particular target architecture. For example, trying to build scx_lavd on ppc64 triggers the following error: error[E0308]: mismatched types --> src/main.rs:200:38 \| 200 \| let c_tx_cm: const c_char = (&tx.comm as const [i8; 17]) as const i8; \| ------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `const u8`, found `const i8` \| \| \| expected due to this \| = note: expected raw pointer `const u8` found raw pointer `*const i8` To fix this, consistently use c_char instead of assuming it corresponds to i8. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-27 17:21:19 +02:00
Changwoo Min	f470b1aa13	scx_lavd: always inline submit_task_ctx to make the verifier happy In _some_ kernel versions, loading scx_lavd fails with an error of "bpf_rcu_read_unlock is missing". The usage of bpf_rcu_read_lock/unlock() in proc_dump_all_tasks() is correct but the bpf verifier still think bpf_rcu_read_unlock() is missing. The most plausible reason so far is that the problematic kernel does not have a commit 6fceea0fa59f ("bpf: Transfer RCU lock state between subprog calls"), failing inter-procedural analysis between proc_dump_all_tasks() and submit_task_ctx(). Thus, we force inline submit_task_ctx() (no inter-procedural analysis by the verifier is necessary) for the time being. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-28 00:11:38 +09:00
Changwoo Min	d0d0a18b10	scx_lavd: fix copyright information Correct the copyright and author information Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-26 16:36:58 +09:00
Andrea Righi	973aded5a8	Merge pull request #238 from sched-ext/rustland-reduce-topology-overhead scx_rustland: reduce overhead by caching host topology	2024-04-24 22:24:23 +02:00
David Vernet	5ba137e8c9	layered: Make layered backwards compat with cpufreq Only the very newest kernels support scx_bpf_cpuperf_set(). Let's update scx_layered to accommodate older kernels as well. Signed-off-by: David Vernet <void@manifault.com>	2024-04-24 14:01:51 -05:00
Tejun Heo	9a9b4dd23e	Merge pull request #239 from hodgesds/cpufreq_helpers Add CPU frequency related helpers and extend scx_layered	2024-04-24 07:22:15 -10:00
Andrea Righi	5302ff1cdc	scx_rustland: use TopologyMap for efficient CPU topology iteration Looking at perf top it seems that the scheduler can spend a significant amount of time iterating over the CPU topology/cpumask information, especially when the system is running a significant amount of tasks: 2.57% scx_rustland [.] <scx_utils::cpumask::CpumaskIntoIterator as core::iter::traits::iterator::Iterator>::next Considering that scx_rustland doesn't support CPU hotplugging yet (it requires a full restart to properly handle CPU hotplug events), we can completely avoid this overhead by caching a TopologyMap object at the beginning, when the scheduler starts, instead of constantly re-evaluating the CPU topology information. This allows to reduce the scheduler overhead by ~5% CPU utilization under heavy load conditions (from ~65% -> ~60%, according to top). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-24 17:08:06 +02:00
Daniel Hodges	32e97bf4d5	Adds CPU frequency related helpers and extend scx_layered This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](https://github.com/sched-ext/sched_ext/pull/180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-04-24 07:27:52 -07:00
David Vernet	a8daf372b2	Merge pull request #241 from sched-ext/cpumask_efficient topology: Don't allocate on calls to span()	2024-04-24 09:21:15 -05:00
David Vernet	24c248eebb	layered: Add support for filtering on process name If a library creates threads, those threads will often have the same name. If two different processes of different priority both use a library, it may be that we want the library's threads in each process to be put into different layers. To support this, let's add the ability to filter not only by task name, but also by process name via the task thread group leader's comm. Tested by creating two executables named "foo" and "bar", which both spawn a bunch of tasks named "exp_worker" that spin until being interrupted. With this config: https://pastebin.com/Uz2phzxQ, the tasks were correctly matched to the expected layers. Signed-off-by: David Vernet <void@manifault.com>	2024-04-23 23:12:37 -05:00
David Vernet	c187c65702	topology: Don't allocate on calls to span() We're currently cloning cpumasks returned by calls to {Core, Cache, Node, Topology}::span(). If a caller needs to clone it, they can. Let's not penalize the callers that just want to query the underlying cpumask. Signed-off-by: David Vernet <void@manifault.com>	2024-04-23 22:59:42 -05:00
David Vernet	a998fb7d01	layered: Clarify f: and file: prefix behavior Some people have expressed confusion at this behavior. Let's be a bit more explicit in the documentation. Signed-off-by: David Vernet <void@manifault.com>	2024-04-23 20:39:28 -05:00
Andrea Righi	fbe9a80af8	scx_rustland: introduce --no-preemption Provide a run-time option to disable task preemption. This option can be used to improve the throughput of the CPU-intensive tasks while still providing a good level of responsiveness in the system. By default preemption is enabled, to provide a higher level of responsiveness to the interactive tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-23 07:13:30 +02:00
Andrea Righi	0ffaaac6db	scx_rustland: enable preemption Use the new scx_rustland_core dispatch flag RL_PREEMPT_CPU to allow interactive tasks to preempt other tasks with scx_rustland. If the built-in idle selection logic is enforced (option `-i`), the scheduler prioritizes keeping tasks on the target CPU designated by this logic. With preemption enabled, these tasks have a higher likelihood of reusing their cached working set, potentially improving performance. Alternatively, when tasks are dispatched to the first available CPU (default behavior), interactive tasks benefit from running more promptly by kicking out other tasks before their assigned time slice expires. This potentially allows to increase the default time slice to higher values in the future, to improve the overall throughput in the system and, at the same time, still maintain a good level of responsiveness, because interactive tasks are now able to run pretty much immediately, independently on the remaining time slice of the other tasks that are contending the CPUs in the system. = Results = Measuring the performance of the usual benchmark "playing a video game while running a parallel kernel build in background" seems to give around 2-10% boost in the fps with preemption enabled, depending on the particular video game. Results were obtained running a `make -j32` kernel build on a AMD Ryzen 7 5800X 8-Cores 16GB RAM, while testing video games such as Baldur's Gate 3 (with a solid +10% fps), Counter Strike 2 (around +5%) and Team Fortress 2 (+2% boost). Moreover, some WebGL applications (such as https://webglsamples.org/aquarium/aquarium.html) seem to benefit even more with preemption enabled, providing up to a +15% fps boost. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-23 07:13:30 +02:00
Andrea Righi	6d2aac1591	scx_rustland_core: introduce dispatch flags Reserve some bits of the `cpu` attribute of a task to store special dispatch flags. Initially, let's introduce just RL_CPU_ANY to replace the special value NO_CPU, indicating that the task can be dispatched on any CPU, specifically the first CPU that becomes available. This allows to keep the CPU value assigned by the builtin idle selection logic, that can potentially be used later for further optimizations. Moreover, having the possibility to specify dispatch flags gives more flexibility and it allows to map new scheduling features to such flags. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-23 07:13:30 +02:00
takase1121	3e12676ca2	scheds-rust: add explanation for chaining schedulers	2024-04-23 08:30:38 +08:00
takase1121	5d20f89a87	scheds-rust: build rust schedulers in sequence	2024-04-23 08:06:27 +08:00
David Vernet	5f1eac85ff	layered: Fix init_task When I transitioned layered to using task local storage, I messed up initializing the task ctx, not realizing we previously had a separate variable that was initializing the hasmap entry. We need to initialize the task's layer to -11, and also set refresh_layer to 1. Signed-off-by: David Vernet <void@manifault.com>	2024-04-18 09:44:32 -05:00
David Vernet	45589cd0f7	lavd: Fix a few typos Noticed a few typos. Let's fix em up Signed-off-by: David Vernet <void@manifault.com>	2024-04-17 08:17:52 -05:00
David Vernet	eed338ef25	simple: Invoke __COMPAT_scx_bpf_switch_all(); scx_simple no longer supports running in "partial" mode, with only certain tasks usig scx_simple. When this option was removed, we also removed the call to scx_bpf_switch_all(); While switching-all is the default behavior for newer kernels, let's add __COMPAT_scx_bpf_switch_all() so that scx_simple can work on older kernels as well. Signed-off-by: David Vernet <void@manifault.com>	2024-04-16 11:09:44 -05:00
David Vernet	ffced1f615	rusty: Remove explicit padding As of libbpf-rs 0.23.0 (which contains commit `9d9e979fcf`), libbpf-rs now generates rust structs that honor padding. We can therefore remove the custom padding in scx_rusty's struct pcpu_ctx. For example, here is the generated pub struct pcpu_ctx: pub struct pcpu_ctx { pub dom_rr_cur: u32, pub dom_id: u32, pub nr_node_doms: u32, pub node_doms: [u32; 64], pub __pad_268: [u8; 52], } And here is the matching struct in the BPF object file: struct pcpu_ctx { u32 dom_rr_cur; /* 0 4 / u32 dom_id; / 4 4 / u32 nr_node_doms; / 8 4 / u32 node_doms[64]; / 12 256 / / size: 320, cachelines: 5, members: 4 / / padding: 52 */ } __attribute__((__aligned__(64))); Signed-off-by: David Vernet <void@manifault.com>	2024-04-12 13:52:13 -05:00
David Vernet	e032ee7cc0	rusty: Add lookup_pcpu_ctx() helper Getting rid of more boilerplate Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:30:23 -05:00
David Vernet	885a9fd7da	rusty: Make lookup_task_ctx() static It doesn't need to be a global prog. Let's make it static. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:30:23 -05:00
David Vernet	0ff73754cf	rusty: Add create_save_cpumask() helper We have a lot of boilerplate code where we create a cpumask, initialize it, and then bpf_kptr_xchg() it into the map. In an effort to slightly reduce the amount of boilerplate, let's create a helper that can alleviate some of it. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:30:21 -05:00
David Vernet	e27d5b4e67	rusty: Fix a few random issues There are some random issues in the code, like unused variables, and bad print formatters. I'm not sure why the compiler isn't consistently complaining, but let's fix them. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:21:02 -05:00
David Vernet	31cc2dccb9	rusty: Allocate DSQ on appropriate NUMA node In scx_rusty, now that we have a complete view of the host's topology thanks to the Topology crate, we can update our calls to scx_bpf_create_dsq() to create the DSQ on the NUMA node of the domain. It's unclear how much this will end up mattering for performance in the typical case, but we might as well do the right thing given that host topolgoy is static, and we have the information. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 00:01:25 -05:00
Dan Schatzberg	6eefc8c27f	Fix error typo ENONET means "Machine is not on the network" - this was supposed to be ENOENT "No such file or directory"	2024-04-10 15:28:05 -04:00
Changwoo Min	f53c29759e	scx_lavd: support preemption (in some scenarios) (#224 ) * scx-lavd: preemption of a lower-priority task using kick cpu When a task is enqueued to the global queue, the scheduler checks if there is a lower priority task than the enqueued task. If so, it kicks out the lower-priority task, hoping the newly enqueued task or another higher-priority task runs on the kicked CPU. Kicking another CPU is expensive as an IPI is involved, so the scheduler judiciously kicks the CPU when its benefit (i.e., priority gap) is clear enough. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-09 14:25:53 +09:00
David Vernet	9a8ed8ab44	Merge pull request #218 from sched-ext/rusty_hotplug Gracefully handle hotplug in scx_rusty	2024-04-04 16:03:59 -05:00
Andrea Righi	17a30bddc9	scx_rustland_core: bump up version to 0.2 Bump up the version of the crate and update dependencies. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-04 22:44:55 +02:00
David Vernet	622b61dd2f	rusty: Support restarting rusty on hotplug events The scx_rusty scheduler does not support hotplug, and expects a static host topology throughout its runtime. Though the kernel does have support for detecting hotplug events, we currently don't detect this in the kernel, nor surface it to user space when it happens. Now that we have scx_bpf_exit(), we can gracefully exit the kernel in the event of a hotplug, and communicate to user space that it should restart the scheduler. This patch adds that support to scx_rusty. Note that this assumes that we're running on a recent enough kernel that has scx_bpf_exit(). If it doesn't, then we instead just error out of the kernel scheduler and exit the application. Signed-off-by: David Vernet <void@manifault.com>	2024-04-04 14:52:48 -05:00
Tejun Heo	ba52cc131b	scx_lavd: Add .gitignore	2024-04-04 07:15:37 -10:00
Andrea Righi	eca7ecd24e	build: introduce kernel_headers build option If we try to cross-build scx on builders with older versions of system's linux headers (such as those provided by linux-libc-headers in older releases of Ubuntu), we may hit build failures, due to the different kernel ABI, such as: error: invalid use of undefined type ‘struct btf_enum64’ To address this, introduce a new build option called "kernel_headers" that allows to specify a custom path for the kernel headers required during the build process. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-04 10:53:36 +02:00
Tejun Heo	a60737a6bf	Merge pull request #207 from sched-ext/api-updates scx: Apply API updates from sched_ext	2024-04-02 14:26:42 -10:00
Tejun Heo	348fe53256	Sync from kernel Synchronize stragglers. - Bug fix in __COMPAT_read_enum(). - A cosmetic difference in scx_qmap.bpf.c. - Stray 'p' when calling getopt() in scx_simple.c. After this the kernel tree and scx repo are in sync.	2024-04-02 11:29:50 -10:00
Tejun Heo	b925bdf94d	Cargo.toml: Update libbpf-rs/cargo dependencies to 0.23 and drop patch.crates-io sections New versions of libbpf-rs and libbpf-cargo are now available with all the needed features. Update the dependencies and drop the patch sections.	2024-04-02 11:19:39 -10:00
Tejun Heo	6f81409df4	Bump versions - scx_utils bumped from 0.6.0 to 0.7.0. - Repo and rust schedulers get a PATCH level bump.	2024-04-02 10:58:50 -10:00
Tejun Heo	f3e20ae9b3	scx_rustland: Apply API updates and add --exit-dump-len option to scx_rustland	2024-04-02 10:30:56 -10:00
David Vernet	5088328f9e	rusty: Check LOCAL_DSQ length for WAKE_SYNC In rusty_select_cpu(), if a task is WAKE_SYNC, we'll currently migrate the task to that CPU if there are any idle cores on the system. As in [0], this condition is insufficient, as there could be idle cores elsewhere on the system, but still tasks piled up on a single local DSQ. Let's add a condition that the local DSQ has to be empty in order to apply the WAKE_SYNC migration. Before patch: [void@maniforge src]$ hackbench Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 100 messages of 100 bytes Time: 0.433 With patch: [void@maniforge src]$ hackbench Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 100 messages of 100 bytes Time: 0.035 Signed-off-by: David Vernet <void@manifault.com>	2024-04-02 15:17:32 -05:00
Tejun Heo	06fdae177f	vmlinux: Update to 5dc95302301fb7e51cd4d218008b9dad10110069	2024-04-02 10:08:18 -10:00
Tejun Heo	98e586ce63	vmlinux: Drop unused old vmlinux headers No need to keep them around.	2024-04-02 10:08:09 -10:00
Tejun Heo	dfa978d166	scx_lavd: Apply API updates	2024-04-02 10:08:02 -10:00
Tejun Heo	0c07f382b1	scx_rusty: Apply API updates	2024-04-02 10:07:54 -10:00
Tejun Heo	59bbd800c1	compat: Implement scx_utils::compat and fix up scx_layered Implement scx_utils::compat to match C's scx/compat.h and update scx_layered. Other rust scheds are still broken.	2024-04-02 07:08:56 -10:00

... 3 4 5 6 7 ...

759 Commits