JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-11-26 03:20:24 +00:00

Author	SHA1	Message	Date
Daniel Hodges	4aa841de0a	scx_layered: Rename HI_FALLBACK_DSQ to HI_FALLBACK_DSQ_BASE Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-20 17:28:38 -04:00
Daniel Hodges	a3d1344293	scx_layered: Add core growth algo for core type Add core growth algos for Big/Little core support. The algos allow layers to grow layers by preferring either big or little cores first. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-20 11:50:15 -04:00
I Hsin Cheng	7799b94f07	scx_layered: Add helper function to access cpumask within bpf_cpumask Before passing "nodec->cpumas" and "cachec->cpumask" into "bpf_cpumask_test_cpu()", type conversion should be done first. Implement "cast_mask()" to convert "struct bpf_cpumask " into "const struct cpumask ". Reference from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_common.h#n63 Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-09-20 20:52:03 +08:00
I Hsin Cheng	5596d5e3fe	scx_bpfland: Remove the usage of cast_mask in bpfland_enqueue The usage of cast_mask() within bpfland_enqueue aims to cast the type of "p->cpus_ptr" from "struct bpf_cpumask " to "const struct cpumask ". However, the type of "p->cpus_ptr" is already "const cpumask_t " aka "const struct cpumask ", so no conversion is needed. Passing a value of type "struct cpumask " into "struct bpf_cpumask " also leads to compiling error. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-09-20 20:45:09 +08:00
Daniel Hodges	8532ba3f1e	scx_layered: Fix hi fallback dsq consumption Fix hi fallback dsq consumption to only consume from the cache local hi fallback dsq. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-20 04:18:05 -04:00
I Hsin Cheng	e4bb99efc5	scx_layered: Refactor match_layer() Refactor match_layer() to prevent the compiling error caused by uninitialization of the variable "nr_match_ors" before usage. Move the checking of "nr_match_ors" after it access the value within "layer->nr_match_ors" to make sure it's initiailized successfully. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-09-19 22:20:03 +08:00
Andrea Righi	3f8db5783b	Merge pull request #658 from sched-ext/rustland-core-improve-cpu-selection scx_rustland_core: improve idle CPU selection API and logic	2024-09-17 22:38:15 +02:00
Andrea Righi	e6b624a97c	scx_rustland_core: improve idle CPU selection API and logic Pass enqueue flags to user-space: flags will be passed via QueuedTask.flags and can be forwarded back to BPF via DispatchedTask.flags. These flags can be also passed to BpfScheduler.select_cpu() to apply a more refined CPU selection policy. Moreover, avoid to prioritize the user-space scheduler too much and dispatch it only if there are no other tasks that needs to be dispatched in ops.dispatch(). This improves CPU utilization and enhances the fairness, robustness, and resilience of schedulers based on scx_rustland_core, particularly under stress test conditions. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-16 22:12:38 +02:00
Daniel Hodges	4f98de333d	Merge pull request #652 from JakeHillion/layer-growth-rr scx_layered: add round robin growth strategy	2024-09-16 17:34:48 +02:00
Andrea Righi	00eebaf905	scx_bpfland: refine task wakeup logic On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker if the waker is not exiting, the wakee can use the waker's CPU, the waker's L3 domain is not saturated and there are not other tasks queued to the local DSQ of the waker's CPU. This is the same logic used in scx_rusty. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-15 14:50:14 +02:00
Andrea Righi	079a53c689	scx_bpfland: get rid of preferred domain Using the turbo boosted CPUs as preferred scheduling seems to be beneficial only a very few corner cases, for example on battery-powered devices with an aggressive cpufreq governor that constantly tries to scale down the frequency (and even in this case it's probably better to not force the tasks to run on the fast CPUs, to save power). In practive the preferred domain seems to introduce more overhead than benefits overall, so let's get rid of it. This can be improved in the future adding multiple user-configurable scheduling domains. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-15 14:50:14 +02:00
Changwoo Min	95e2f4dabe	scx_lavd: boost the latency critility of kernel threads Many kernel threads performs latency critical tasks (e.g., net, gpu). In particular, AMD GPU driver runs the most part in the kernel space using kworker. Hence, treat kernel threads as if a woken up task. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-14 00:41:02 +09:00
Changwoo Min	4b4f42fce1	scx_lavd: add a short circuit for the case of no turbo core Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-13 16:02:07 +09:00
Jake Hillion	3848d87895	scx_layered: add round robin growth strategy	2024-09-12 23:27:21 +01:00
Daniel Hodges	632fcfe4ae	Merge pull request #648 from hodgesds/layered-llc-stats scx_layered: Add stats for XNUMA/XLLC migrations	2024-09-12 13:23:23 -04:00
Daniel Hodges	dde6e0c7f9	scx_utils: Add node/llc id to core topology Add ids for node/llc in the Core topology struct.	2024-09-12 10:05:02 -07:00
Daniel Hodges	aee19dd9a1	scx_layered: Add topology aware core growth selection Add topology aware core growth selection. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-12 06:48:51 -07:00
Daniel Hodges	14a19dc3ca	scx_layered: Add random layer growth algo Add a random layer growth algo. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-12 05:35:54 -07:00
Daniel Hodges	ae57f8d1f9	scx_rusty: Initialize node cpumask Initialize the node cpumask, which was previously uninitialized causing metric calculations to be wrong when attempting to lookup CPUs in the node cpumask. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-11 13:14:44 -07:00
Jake Hillion	8ca45cfa37	lint: enable cargo fmt (#643 ) Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it. Test plan: - CI lint check passes.	2024-09-11 10:03:20 +01:00
Daniel Hodges	43ec8bfe82	scx_layered: Add stats for XNUMA/XLLC migrations Add stats for XNUMA/XLLC migrations. An example of the output is shown: ``` hodgesd : util/frac= 5.4/ 0.1 load/frac= 301.0/ 0.3 tasks= 476 tot= 3168 local=97.82 wake/exp/reenq= 2.18/ 0.00/ 0.00 keep/max/busy= 0.03/ 0.00/ 0.03 kick= 0.00 yield/ign= 0.09/ 0 open_idle= 0.00 mig= 6.82 xnuma_mig= 6.82 xllc_mig= 4.86 affn_viol= 0.00 preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/ 0.00ms cpus= 2 [ 2, 4] 00000000 00000010 00001000 normal : util/frac= 28.7/ 0.7 load/frac= 101704.7/ 95.8 tasks= 2450 tot= 4660 local=99.06 wake/exp/reenq= 0.88/ 0.06/ 0.00 keep/max/busy= 1.03/ 0.00/ 0.00 kick= 0.06 yield/ign= 0.04/ 400 open_idle=15.73 mig=23.45 xnuma_mig=23.45 xllc_mig= 3.07 affn_viol= 0.00 preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.88 min_exec= 0.00/ 0.00ms cpus= 2 [ 2, 2] 00000001 00000100 00000000 excl_coll=12.55 excl_preempt= 0.00 random : util/frac= 0.0/ 0.0 load/frac= 0.0/ 0.0 tasks= 0 tot= 0 local= 0.00 wake/exp/reenq= 0.00/ 0.00/ 0.00 keep/max/busy= 0.00/ 0.00/ 0.00 kick= 0.00 yield/ign= 0.00/ 0 open_idle= 0.00 mig= 0.00 xnuma_mig= 0.00 xllc_mig= 0.00 affn_viol= 0.00 preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/ 0.00ms cpus= 0 [ 0, 0] 00000000 00000000 00000000 excl_coll= 0.00 excl_preempt= 0.00 stress-ng: util/frac= 4189.1/ 99.2 load/frac= 4200.0/ 4.0 tasks= 43 tot= 62 local= 0.00 wake/exp/reenq= 0.00/100.0/ 0.00 keep/max/busy=2433.9/177.4/ 0.00 kick=100.0 yield/ign= 3.23/ 0 open_idle= 0.00 mig=54.84 xnuma_mig=54.84 xllc_mig=35.48 affn_viol= 0.00 preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/ 0.00ms cpus= 4 [ 4, 4] 00000300 00030000 00000000 excl_coll= 0.00 excl_preempt= 0.00 ``` Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-10 19:53:28 -07:00
Tejun Heo	8f0cc89ee8	Merge pull request #645 from frelon/rusty-init-dom scx_rusty: init domains when calculating averages	2024-09-10 12:25:51 -10:00
Andrea Righi	e6e3579a92	Merge pull request #634 from anh0516/main scx_bpfland: Documentation consistency fix	2024-09-10 23:25:55 +02:00
Fredrik Lönnegren	f155966b77	scx_rusty: init domains when calculating averages The domains are added to the aggregator when load is added (and duty_cycle is not 0.0f64). This commit makes sure that all domains are added to the aggregator even when the calculated duty_cycle is 0. Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>	2024-09-10 21:51:41 +02:00
likewhatevs	85863d0e1c	Merge pull request #644 from hodgesds/layered-topo-order scx_layered: Pass layer spec for core growth algo	2024-09-10 14:49:37 -04:00
Daniel Hodges	5fdd257862	scx_layered: Pass layer spec for core growth algo Pass in the layer spec when determining the layer core growth algo. This should make it easier to implement layer growth algos that are spec specific. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-10 10:27:08 -07:00
Samuel Nair	c6af1aa1c8	scx_layered: Fix typo in stats	2024-09-10 08:44:57 -07:00
likewhatevs	c4c3659b6d	Merge pull request #638 from likewhatevs/remove-rlimit-dep remove dependency on rlimit.rs	2024-09-10 03:14:12 -04:00
Andrea Righi	655ed5b4c6	scx_bpfland: use sum_exec_runtime to evaluate task's used time slice Using p->scx.slice to evaluate the consumed time slice can be a bit imprecise, because the sched_ext core implements yielding by setting p->scx.slice to 0. When the task's vruntime is evaluated this is considered as the task has exhausted its entire allocated time slice, even though it voluntarily released the CPU before the slice fully expired. To avoid this inaccuracy and prevent penalizing tasks that voluntarily release the CPU, always evaluate the used time slice based on the difference in the task's total execution time (p->se.sum_exec_runtime). This method provides a more precise calculation of vruntime and results in a fairer task's deadline evaluation. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-10 08:03:35 +02:00
patso	c1df85914b	remove dependency on rlimit.rs the rlimit crate is the only dependency crate with a build.rs. build.rs files complicate portability. this removes the need for rlimit.rs	2024-09-10 01:16:53 -04:00
Tejun Heo	56bb963136	build: Use a single top-level rust workspace Rust build was using two separate workspaces - rust/ and scheds/rust. There's no reason to separate them and it makes doc generation tricky. Use single top level workspace so that we can drive all rust building from cargo.	2024-09-08 14:23:48 -10:00
patso	120211d731	split build and test jobs split build and test jobs to reduce ci turnaround time and make it clear what is failing when something fails. also add virtiofsd to deps to make test compilation faster (most test time is compliation) and remove all force 9ps.	2024-09-08 02:54:24 -04:00
Changwoo Min	17e0e08e6e	Merge pull request #621 from multics69/lavd-greedy-fix scx_lavd: improve greedy ratio calculation and more	2024-09-07 10:52:00 +09:00
Tejun Heo	6f8917ceca	Merge pull request #624 from JakeHillion/cleanup-layer_growth_algo scx_layered: clean up Layer::new layer_growth_algo	2024-09-06 15:10:41 -10:00
Avraham Hollander	f71cc646a3	scx_bpfland: Fix in README.md for the same text as a comment in the source	2024-09-06 19:12:33 -04:00
Jake Hillion	2c008b2afa	scx_layered: clean up Layer::new layer_growth_algo	2024-09-06 18:25:50 +01:00
Changwoo Min	36df970a8f	scx_lavd: add debug print for turbo cores Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-06 19:23:17 +09:00
Changwoo Min	351a1c6656	scx_lavd: enable autopilot mode by default Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-06 19:23:12 +09:00
Andrea Righi	8231f8586a	scx_rlfifo: better documentation and code readability Simplify scx_rlfifo code, add detailed documentation of the scx_rustland_core API and get rid of the additional task queue, since it just makes the code bigger, slower and it doesn't really provide any benefit (considering that we are dispatching the tasks in FIFO order anyway). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-06 11:25:24 +02:00
Andrea Righi	ed879bae28	scx_rustland_core: expose enq_flags to user-space Pass the enqueue flags to the user-space scheduler through the QueuedTask struct. These flags allow the user-space scheduler to make more informed scheduling decisions. Also bump up scx_rustland_core minor version to reflect the new API (we are not breaking the old API, so we don't need to bump the major version in this case). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-06 11:25:24 +02:00
Changwoo Min	ebe9375b6a	scx_lavd: pretty printing of status Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-06 16:27:20 +09:00
Changwoo Min	461cb9a3a0	scx_lavd: fix calculation of greedy_ratio The service time (taskc->svc_time) should be the sum of total CPU time consumed not jut a delta. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-06 16:22:40 +09:00
Tejun Heo	46fc2e1a49	version: v1.0.4	2024-09-05 18:12:45 -10:00
Tejun Heo	cd555741d0	rust: Synchronize depency versions	2024-09-05 17:10:02 -10:00
Changwoo Min	e3243c5d51	Merge pull request #612 from multics69/lavd-monitor scx_lavd: add --monitor flag and two micro-optimizations	2024-09-06 09:33:55 +09:00
Changwoo Min	d9274bd8e6	scx_lavd: drop time slice boost for big cores Unexpectedly, little cores, which have relative short time slices, have more chance to schedule performance-critical tasks. Hence it is better to keep the time slice same regardless the core types. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-06 09:32:38 +09:00
Changwoo Min	fdecba227c	scx_lavd: print more info with --monitor Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-06 09:32:31 +09:00
Daniel Hodges	0fa369b914	Merge pull request #619 from hodgesds/stats-fixes scx_layered: Fix stats typo	2024-09-05 15:44:15 -04:00
Daniel Hodges	25e1642bbc	scx_layered: Fix stats typo Small typo fix Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-05 14:12:28 -04:00
Andrea Righi	918cfc613d	scx_bpfland: optimize producer/consumer workloads When selecting an idle CPU for a task that has been woken up, prioritize reusing the same CPU if the waker and wakee share the same L3 cache. Otherwise, attempt to migrate the wakee to the waker's CPU, provided it is allowed by the wakee's scheduling domain. This seems to consistently improve FPS performance when the system is not operating over its full capacity. Example: $ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600 - before: ~18305.77 FPS - after: ~19060.62 FPS Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-05 19:02:09 +02:00
Andrea Righi	28050dcd7d	Merge pull request #615 from sched-ext/bpfland-auto scx_bpfland: enable "auto" mode by default	2024-09-05 19:01:50 +02:00
Daniel Hodges	e6ed9b05ba	Merge pull request #614 from hodgesds/layered-stats-fix scx_layered: Fix stats formatting	2024-09-05 12:54:56 -04:00
Andrea Righi	844c00fd26	scx_bpfland: enable "auto" mode by default Rename "turbo domain" to "preferred domain", that conceptually is more generic and introduce the new option `--preferred-domain CPUMASK`, which allows users to define the preferred domain, specifying a cpumask as a hex number. By default ("auto") the scheduler will always try to detect and use the fastest CPUs in the system. Moreover, adjust the cpufreq logic to use "auto" both with the "balance_power" and "balance_performance" EPP profiles. Then, enable "auto" mode by default: the scheduler will try to automatically determine the optimal primary domain, preferred domain and cpufreq level, based on the selected scheduler and energy profiles. Tested-by: Piotr Gorski < piotr.gorski@cachyos.org > Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-05 16:11:12 +02:00
Daniel Hodges	76ad880475	scx_layered: Fix stats formatting Fix formatting precision of stats to have lower precision for readability. The existing formatting is hard to read: tot= 1538 local=31.27 open_idle= 2.73 affn_viol=23.80 proc=4ms busy= 1.1 util= 16.6 load= 32.7 fallback_cpu= 6 excl_coll=0.06501950585175553 excl_preempt=0.26007802340702213 excl_idle=0.16384915474642392 excl_wakeup=0.25097529258777634 With this fix stats are far more readable formatting: tot= 441 local=33.56 open_idle= 0.00 affn_viol=20.63 proc=3ms busy= 0.4 util= 6.3 load= 33.6 fallback_cpu= 6 excl_coll=0.454 excl_preempt=0.000 excl_idle=0.132 excl_wakeup=0.200 Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-05 06:44:54 -04:00
Changwoo Min	f490a55d54	scx_lavd: accmulate more system-wide statistics Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-05 16:03:14 +09:00
Changwoo Min	e5d27d0553	scx_lavd: print basic system status when --monior is given Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-05 16:03:14 +09:00
Changwoo Min	6b717a3f3d	scx_lavd: add --help-stats option Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-05 16:03:14 +09:00
Changwoo Min	ca1c86eb9c	scx_lavd: improve pick_idle_cpu() for pinned tasks When a pinned task cannot run on either active or overflow sets, we try to stay on the previous CPU which is still okay to run on. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-05 16:03:14 +09:00
Andrea Righi	afc7b5404b	Merge pull request #600 from sched-ext/bpfland-cpufreq scx_bpfland: improve cpufreq awareness	2024-09-05 07:32:10 +02:00
Tejun Heo	f010eda5c0	meson: Remove scheds/rust/*/meson.build These aren't used since `43950c65` ("build: Use workspace to group rust sub-projects"). Drop them.	2024-09-04 06:40:17 -10:00
Andrea Righi	c3cab45f6a	scx_rustland_core: bump up version to 2.0.1 Bump up scx_rustland_core version to include this critical fix that allows to prevent scheduler stalls: `94a3594` ("scx_rustland_core: always dispatch per-cpu kthreads directly") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-04 08:00:25 +02:00
Andrea Righi	918f1db4bd	scx_bpfland: dynamically adjust cpufreq level in auto mode In auto mode, rather than keeping the previous fixed cpuperf factor, dynamically calculate it based on CPU utilization and apply it before a task runs within its allocated time slot. Interactive tasks consistently receive the maximum scaling factor to ensure optimal performance. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-03 21:36:48 +02:00
Daniel Hodges	9c5717577f	Merge pull request #601 from hodgesds/namespace-helpers scx_helpers: Add pid namespace helpers	2024-09-03 14:38:26 -04:00
Daniel Hodges	8f4e9e5e3b	scx_helpers: Add pid namespace helpers Add pid namespace helpers for translating namespace pids. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-09-03 11:21:32 -07:00
Andrea Righi	fe6ac15015	scx_bpfland: improve turbo domain CPU selection Always consider the turbo domain when running in "auto" mode. Additionally, when the turbo domain is used, split the CPU idle selection logic into two stages: 1) in ops.select_cpu(), provide the task with a second opportunity to remain within the same LLC 2) in ops.enqueue(), perform another check for an idle CPU, allowing the task to move to a different LLC if an idle CPU within the same LLC is not available. This allows tasks to stick more on turbo-boosted CPUs and CPUs within the same LLC. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-03 09:59:29 +02:00
Andrea Righi	70b93ed641	scx_bpfland: skip idle CPU selection for tasks with changing affinity When tasks are changing CPU affinity it is pointless to try to find an optimal idle CPU. In this case just skip the the idle CPU selection step and let the task being dispatched to a global DSQ if needed. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-03 09:59:29 +02:00
Andrea Righi	802d104b46	scx_bpfland: add basic cpufreq support Add hints for the cpufreq governor based on the selected scheduler's performance profile and the current energy performance preference (EPP). With this change applied the scheduler works as following: scheduler profile (--primary-domain option): - default: - use all cores - cpufreq: use default scaling factor - powersave: - use E-cores - cpufreq: use min scaling factor - performance: - use P-cores - cpufreq: use max scaling factor - auto: - EPP: power, powersave - use E-cores - cpufreq: use min scaling factor - EPP: balance_power (typically battery-powered systems) - use E-cores - cpufreq: use default scaling factor - EPP: balance_performance, performance - use P-cores - cpufreq: use max scaling factor Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-03 09:59:29 +02:00
Andrea Righi	d0fb29a0f7	scx_rustland: aggressively prioritize interactive tasks scx_rustland was originally designed as a PoC to showcase the benefits of implementing specialized schedulers via sched_ext, focusing on a very specific use case: prioritize game responsiveness regardless of what runs in the background. Its original design was subsequently modified to better serve as a general-purpose scheduler, balancing the prioritization of interactive tasks with CPU-intensive ones to prevent over-prioritization. With scx_bpfland serving as a more "general-purpose" scheduler, it makes sense to revisit scx_rustland's original goal and make it much more aggressive at prioritizing interactive tasks, determined in function of their average amount of context switches. This change makes scx_rustland again a really good PoC to showcase the benefits of having specialized schedulers, by focusing only at a very specific use case: provide a high and stable frames-per-second (fps) while a kernel build is running in the background. = Results = - Test: Run a WebGL application [1] while building the kernel (make -j32) - Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz +----------------------+--------+--------+ \| Scheduler \| avg fps\| stdev \| +----------------------+--------+--------+ \| EEVDF \| 28 \| 4.00 \| \| scx_rustland-before \| 43 \| 1.25 \| \| scx_rustland-after \| 60 \| 0.25 \| +----------------------+--------+--------+ [1] https://webglsamples.org/aquarium/aquarium.html Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-09-02 15:53:35 +02:00
Changwoo Min	172fe1efc6	Merge pull request #597 from multics69/lavd-turbo-tuning2 scx_lavd: misc updates (verifier, README, monitor option name, and micro-optimization)	2024-09-02 18:00:26 +09:00
Changwoo Min	0108b83050	scx_lavd: make the old verifier happy (bpf_cpumask_set_cpu) An old BPF verifier does not allow calling bpf_cpumask_set_cpu() in the BPF syscall context, so we defer actual bpf_cpumask_set_cpu() to the timer handler, update_sys_stat(), to workaround the problem. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-02 18:00:12 +09:00
Changwoo Min	3bc2fd4977	scx_lavd: update README Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-02 18:00:12 +09:00
Changwoo Min	afbebaeed6	scx_lavd: check a core type of previous cpu at pick_idle_cpu() If a task is performance-critical, pick_idle_cpu() checks if the previous core is a big core or not. If not, don't try to run on previous core since a performance-critical task is better to run on a big core. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-01 17:28:16 +09:00
Changwoo Min	f2122c4197	Merge pull request #595 from multics69/lavd-turbo-tuning scx_lavd: improve the autopilot mode	2024-09-01 16:24:41 +09:00
Andrea Righi	1595445a63	Merge pull request #594 from sched-ext/scx-rustland-core-version-2 scx_rustland_core: bump up major version to 2.0.0	2024-09-01 08:57:32 +02:00
Changwoo Min	5ca4501139	scx_lavd: dynamically decide autopilot's low watermark A single threshold for a low watermark does not work well across systems with various numbers of cores and core types. Instead of using a single low watermark value, we dynamically decide the low watermark: 1) until one little core is fully utilized or 2) until two big cores are fully utilized. This works better across systems. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-09-01 12:46:57 +09:00
Andrea Righi	0aa71c832b	scx_rustland_core: bump up major version to 2.0.0 The scx_rustland_core API has been redesigned recently, breaking the compatibility with the past. Considering that Rust crates should update their major version when the previous API becomes incompatible [1], bump up the version to 2.0.0. [1] https://semver.org/ Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-31 23:23:26 +02:00
Andrea Righi	2cbf252019	scx_bpfland: directly dispatch only per-cpu kthreads with local_kthreads We want to directly dispatch only kthreads when local_kthreads is enabled, not all tasks that can run on a single CPU. Fixes: `7cc1846` ("scx_bpfland: always rely on prev_cpu with single-CPU tasks") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-31 16:35:54 +02:00
Changwoo Min	4a7b806dd2	scx_lavd: when no_freq_scaling, always set to the max freq When the no_freq_scaling changes during runtime in the autopilot mode, the last target freq set would not be 1024. So the performance mode enabled by the autopilot mode would not run in the best profile. Hence, we set the target freq to 1024 always when no_freq_scaling is set. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 18:22:33 +09:00
Daniel Hodges	63a2eecce8	Merge pull request #592 from hodgesds/layered-ts-fixes scx_layered: Fix layer timeslice not being applied	2024-08-30 15:34:57 -04:00
Daniel Hodges	e04b612688	scx_layered: Fix layer timeslice not being applied Fix a small bug where the layer timeslice is not applied. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-30 11:53:42 -07:00
Changwoo Min	4d8bf870a1	Merge pull request #591 from multics69/lavd-turbo3 scx_lavd: introduce "autopilot" mode and misc. optimization & bug fix	2024-08-31 02:14:35 +09:00
Andrea Righi	f782467eaf	scx_rustland: convert to scx_stats This allows scx_rustland to avoid generating excessive logs for statistics while still allowing detailed monitoring on demand. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-30 18:32:32 +02:00
Changwoo Min	9091dd983b	scx_lavd: add "--autopilot" mode Add "--autopilot" option and mode. In the autopilot mode, the scheduler dynamically changes its power mode according to system's load (cpu utilization). When the cpu utilization is low enough (say <=5%), it switches to the powersave mode since there is nothing to process fast so powersaving is the primary goal. When the utilization is moderate (say >5%, <=30%), it runs in balanced mode. When the utilization is high enough (say >30%), it runs in performance mode. Note that it only changes scheduler's power mode but it does not change system's energy profile. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 01:14:33 +09:00
Changwoo Min	5ecaa9ebe2	scx_lavd: improve the accuracy of cpu utilization calculation When a cpu is idle for a whole interval, its idle time does not correctlyh adds up so the utilization of such cpu tends to be higher than the actual utilization. Now it is fixedk, so cpu utilization becomes more accurate. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 01:14:33 +09:00
Changwoo Min	2f8cc0d60f	scx_lavd: rename the "--auto" opetion to "--autopower" to be clear Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 01:14:33 +09:00
Changwoo Min	815f1263b2	scx_lavd: reinitialize active cpumask when power mode changes When the power mode changes back to performance mode, we should active/overflow cpumask to its initial state -- all big cores are in active cpumask and all little cores are in overflow cpumask. Otherwise, the active/overflow cpumasks will be used in the perfformance mode. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 01:14:33 +09:00
Changwoo Min	afb8c78a09	scx_lavd: print power mode change in the auto mode Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 01:14:33 +09:00
Changwoo Min	a89a56dba4	scx_lavd: add a fastpath in ops.select_cpu() for a sharply pinned task If a task can be run only on a single cpu, we don't need to go through all the steps in ops.select_cpu(). Instread, we simply check if a task is still pinned on the prev_cpu and go. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-31 01:14:33 +09:00
Andrea Righi	b54fc202b8	Merge pull request #583 from sched-ext/bpfland-fix-pcpu-direct-dispatch scx_bpfland: always rely on prev_cpu with single-CPU tasks	2024-08-30 18:12:59 +02:00
Andrea Righi	7cc18460b9	scx_bpfland: always rely on prev_cpu with single-CPU tasks When selecting an idle for tasks that can only run on a single CPU, always check if the previously used CPU is sill usable, instead of trying to figure out the single allowed CPU looking at the task's cpumask. Apparently, single-CPU tasks can report a prev_cpu that is not in the allowed cpumask when they rapidly change affinity. This could lead to stalls, because we may end up dispatching the kthread to a per-CPU DSQ that is not compatible with its allowed cpumask. Example: kworker/u32:2[173797] triggered exit kind 1026: runnable task stall (kworker/2:1[70] failed to run for 7.552s) ... R kworker/2:1[70] -7552ms scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369 cpus=04 In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall. To prevent this, do not try to figure out the best idle CPU for tasks that are changing affinity and just dispatch them to a global DSQ (either priority or regular, depending on its interactive state). Moreover, introduce an explicit error check in dispatch_direct_cpu() to improve detection of similar issues in the future, and drop lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now safely handle all the cases where the task context is not found. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-30 09:45:58 +02:00
Changwoo Min	3e2e78a9ec	Merge pull request #584 from multics69/lavd-turbo2 scx_lavd: automatically determine power mode and more	2024-08-30 08:56:16 +09:00
Daniel Hodges	47184e9d19	Merge pull request #582 from hodgesds/layered-growth-interface scx_layered: Add layer growth config	2024-08-29 18:49:59 -04:00
Changwoo Min	bb08919203	scx_lavd: determine power mode automatically with --auto option It checkes the EPP (energy performance preference) peirodically and sets the power profile of the scheduler during runtiime as a user changes its EPP profile (from her desktop UI). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-29 19:15:23 +09:00
Andrea Righi	cc3f696c4b	Merge pull request #577 from sched-ext/bpfland-task-affinity scx_bpfland: enhanced task affinity	2024-08-29 07:46:57 +02:00
Daniel Hodges	7e0329e45c	scx_layered: Add layer growth config Add a per layer config for different implementations of layer growth algorithms. Convert the existing default logic into a default layer growth algorithm and add a linear implementation. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 19:17:24 -07:00
Daniel Hodges	cf765562c7	scx_layered: Update docs for layer slice setting Add docs for layer slice setting. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 22:12:07 -04:00
Daniel Hodges	a23308e7b0	scx_layered: Add more docs on tuning Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 12:38:05 -07:00
Daniel Hodges	96326b1ef3	scx_layered: Add additional docs Add some additional docs on tuning layered. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 12:27:26 -07:00
Daniel Hodges	cc450f1a4b	scx_layered: Add per layer timeslice Allow setting a different timeslice per layer. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 11:21:03 -07:00
Daniel Hodges	c511b42b7b	scx_layered: Make verification easier on older kernels Refactor some BPF code to make verification easier on older kernels. This is to make it easier to maintain backports. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 08:05:10 -07:00
Daniel Hodges	12f8cb74b5	scx_utils: Add GPU topology Add GPU awareness to the topology crate. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 06:35:35 -07:00
Andrea Righi	28cb1ec5cb	scx_bpfland: enhanced task affinity Aggressively try to keep tasks running on the same CPU / cache / domain, to achieve higher performance when the system is not over commissioned. This is done by giving a second chance in ops.enqueue(), in addition to ops.select_cpu(), to find an idle CPU close to the previously used CPU. Moreover, even if the task is dispatched to the global DSQs, always try to check if there is an idle CPU in the primary domain that can immediately consume the task. = Results = This change seems to provide a minor, but consistent, boost of performance with the CPU-intensive benchmarks from the CachyOS benchmarks selection [1]. Similar results can also be noticed with some WebGL benchmarks [2], when system usage is close to its maximum capacity. Test: - cachyos-benchmarker System: - AMD Ryzen 7 5800X 8-Core Processor Metrics: - total time: elapsed time of all benchmarks - total score: geometric mean of all benchmarks NOTE: total time is the most relevant, since it gives a measure of the aggregate performance, while the total score emphasizes more on performance consistency across all benchmarks. == Results: summary == +-------------------------+---------------------+---------------------+ \| Scheduler \| Total Time \| Total Score \| \| \| (less = better) \| (less = better) \| +-------------------------+---------------------+---------------------+ \| EEVDF \| 624.44 sec \| 123.68 \| \| bpfland \| 625.34 sec \| 122.21 \| \| bpfland-task-affinity \| 623.67 sec \| 122.27 \| +-------------------------+---------------------+---------------------+ == Conclusion == With this patch applied, bpfland shows both a better performance and consistency. Although the gains are small (less than 1%), they are still significant for this type of benchmark and consistently appear across multiple runs. [1] https://github.com/CachyOS/cachyos-benchmarker [2] https://webglsamples.org/aquarium/aquarium.html Tested-by: Piotr Gorski < piotr.gorski@cachyos.org > Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 10:30:54 +02:00
Avraham Hollander	6c5d85401d	Merge branch 'sched-ext:main' into main	2024-08-27 23:07:54 -04:00
Avraham Hollander	2a3cbeb760	scx_lavd: Add same power mode clarification to --no-prefer-turbo-core	2024-08-27 23:06:31 -04:00
Changwoo Min	5588126cff	scx_lavd: minior optimization for consume_task() When iterating neighbors, the existing code unnecessarily iterates all the neighbors to the maximum even if there is no neighors. So the fix escapes early when there is no neighbors. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-28 10:26:50 +09:00
Changwoo Min	95272ae910	scx_lavd: proper handling of ctrl-c in a monitoring mode Ctrl-c wasn't properly handled in the monitoring mode (`--monitor-sched-samples`), so the scheduler could not be terminated by pressing ctrl-c. The missing ctrl-c handling is added to the monitor thread. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-28 10:05:34 +09:00
Changwoo Min	9c4428fd8b	scx_lavd: remove unused rust functions Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-28 10:02:11 +09:00
Andrea Righi	a155d5185d	scx_bpfland: rely on Topology to classify core types Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs. Moreover, support the special keyword "all" with --primary-domain to include all the CPUs in the system (default). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 00:23:55 +02:00
Andrea Righi	872e653cd2	scx_utils: introduce Turbo core type to Topology Integrate the logic used by scx_bpfland to detect turbo-boosted cores in Topology. Also change the logic to detect Big/Little cores in function of base_frequency, instead of scaling_max_freq, otherwise turbo-boosted cores in homogeneous systems may be incorrectly classified as Big. Moreover, introduce the following new methods to Cpu to check for the core type: - is_turbo(): return true if the CPU is Turbo, false otherwise - is_big(): return true if the CPU is either Turbo or Big - is_little(): return true if the CPU is Little Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 00:09:08 +02:00
Daniel Hodges	41cebb807a	Merge pull request #569 from anh0516/main scx_layered: Clean up in-code documentation; add commas for consistency	2024-08-27 09:47:29 -04:00
Andrea Righi	6768f9f88c	Merge pull request #572 from sched-ext/bpfland-fix-turbo-domain scx_bpfland: fix turbo boost domain nullifying primary domain limits	2024-08-27 15:23:12 +02:00
Andrea Righi	e0f49a338a	scx_bpfland: fix turbo boost domain nullifying primary domain limits When creating the turbo boost scheduling domain, we might use a full CPU mask (selecting all possible CPUs) to indicate "do not prioritize turbo boost CPUs" or when all CPUs have the same maximum frequency. This approach works when the primary domain also contains all the CPUs, as the complete overlap allows the CPU selection logic to ignore the turbo boost domain and start picking CPUs directly from the primary domain. However, if the primary domain doesn't include all CPUs, the two domains won't fully overlap, which can lead to the turbo boost domain incorrectly including all CPUs, thereby negating the restrictions set by the primary scheduling domain. To resolve this, an empty CPU mask should be used for the turbo boost domain when turbo boost CPUs aren't prioritized. If the turbo boost domain is empty, it should be entirely bypassed, and the selection should proceed directly to the primary domain. Reported-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-27 13:36:50 +02:00
Changwoo Min	00430c3ded	scx_lavd: make a loop easier to correctly verify With an ill combination of old kernel and old LLVM, the BPF verifier incorrectly detects an infinite loop. After changing the loop with a constant end, the old verifier can pass the code. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-27 17:11:20 +09:00
Changwoo Min	09cff560aa	Merge pull request #566 from multics69/lavd-turbo scx_lavd: prioritize the turbo boost-able cores	2024-08-27 08:47:25 +09:00
Daniel Hodges	83cd26eb9e	Merge pull request #564 from hodgesds/layered-help scx_layered: Update help for tgid matching	2024-08-26 14:52:53 -04:00
Andrea Righi	35db89e90d	Merge pull request #568 from sched-ext/rustland-core-design-improv scx_rustland_core: small core design improvements	2024-08-26 20:06:21 +02:00
Avraham Hollander	7a43801d76	Add quotes for clarity	2024-08-26 13:20:01 -04:00
Avraham Hollander	0b6ebf826e	scx_lavd, scx_mitosis, scx_rusty: Add comma for grammatical consistency with the same change in the other schedulers	2024-08-26 13:06:58 -04:00
Avraham Hollander	07039f1f07	scx_layered: Documentation cleanup	2024-08-26 13:03:52 -04:00
Andrea Righi	1427d7d347	scx_rlfifo: enhance code design Refactor the code design to make it more suitable as a template for implementing advanced scheduling policies. In particular, create separate loops for task consumption and task dispatching. This will make the scheduler easier to adapt as a foundation for implementing more complex scheduling policies. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-26 16:10:54 +02:00
Daniel Hodges	c45c2de39f	scx_layered: Update help for tgid matching Forgot to add doc for tgid matching Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-26 07:06:21 -07:00
Changwoo Min	9807e561f0	scx_lavd: prioritize the turbo boost-able cores Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:57:33 +09:00
Changwoo Min	cd5b2bf664	scx_lavd: replace nix signal handler to ctrlc Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:57:33 +09:00
Changwoo Min	e887c56da0	scx_lavd: add "--version" option, which prints the current version Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:57:33 +09:00
Changwoo Min	0f97ca3066	scx_lavd: drop time slice calculation in ops.select_cpu() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:55:00 +09:00
Changwoo Min	4e3c36ca3f	scx_lavd: handle the missing cases in time slice calculation Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	be7d06e280	scx_lavd: make the old BPF verifier happy :-( Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	82f55b95b2	scx_lavd: add a fast path in pick_idle_cpu() when SMT is not activated Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	38779dbe8b	scx_lavd: improve pick_idle_cpu() Now it checks an active cpumask within a previous core's compute domain before checking the full active CPUs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	d1d9e97d08	scx_lavd: reduce LAVD_CPDOM_MAX_DIST to 4 The BPF verifier in the old kernel gives up to analysis the nested loop in the consume_task(). We reduce the loop less complex by reducing LAVD_CPDOM_MAX_DIST from 6 to 4 in order to make the verifier happy. Note that the theoretical maximum distance is 6 (numa > llc > core type) but there is no such hardware today, hence reducing it to 6 should be okay in next few years, when hopefully the verifier becomes smarter. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	950710990f	scx_lavd: move time slice calculation to ops.enqueue() and ops.select_cpu() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	954b684a70	scx_lavd: update nr_queued_task every system stat update interval Updating nr_queue_task every runqueue operation is expensive and unnecessary. So we do update every system state update interval and use moving average, which is accurate enough. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	4f906f1f49	scx_lavd: update README since it supports multi-CCX/NUMA Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	9551657b42	scx_lavd: prefer big cores in the performance mode Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	d4bb35e651	scx_lavd: use itertools::iproduct!() for a nested loop Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	9368c6881d	scx_lavd: replace get_task_cpu_id() to scx_bpf_task_cpu() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Andrea Righi	a469f0f1ce	Merge pull request #561 from sched-ext/bpfland-fix-energy-profile-refresh scx_bpfland: prevent reading energy profile if not available	2024-08-25 18:31:34 +02:00
Tejun Heo	ca13e13ad6	Merge pull request #559 from sched-ext/htejun/cargo-workspace build: Use workspace to group rust sub-projects	2024-08-25 06:26:18 -10:00
Andrea Righi	f8acd069f0	scx_bpfland: prevent reading energy profile if not available Avoid to periodically read the current performance profile from /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if it's not available (i.e., with older CPUs or kernels without cpufreq). This fixes issue #560. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 16:53:35 +02:00
Andrea Righi	8853d9a9f2	Merge pull request #548 from sched-ext/rustland-core-refactoring scx_rustland_core: user-space framework refactoring	2024-08-25 16:39:28 +02:00
Tejun Heo	43950c65bd	build: Use workspace to group rust sub-projects meson build script was building each rust sub-project under rust/ and scheds/rust/ separately. This means that each rust project is built independently which leads to a couple problems - 1. There are a lot of shared dependencies but they have to be built over and over again for each proejct. 2. Concurrency management becomes sad - we either have to unleash multiple cargo builds at the same time possibly thrashing the system or build one by one. We've been trying to solve this from meson side in vain. Thankfully, in issue #546, @vimproved suggested using cargo workspace which makes the sub-projects share the same target directory and built together by the same cargo instance while still allowing each project to behave independently for development and publishing purposes. Make the following changes: - Create two cargo workspaces - one under rust/, the other under scheds/rust/. Each contains all rust projects underneath it. - Don't let meson descend into rust/. These are libraries used by the rust schedulers. No need to build them from meson. Cargo will build them as needed. - Change the rust_scheds build target to invoke `cargo build` in scheds/rust/ and let cargo do its thing. - Remove per-scheduler meson.build files and instead generate custom_targets in scheds/rust/meson.build which invokes `cargo build -p $SCHED`. - This changes rust binary directory. Update README and meson-scripts/install_rust_user_scheds accordingly. - Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all schedulers now. - Unify .gitignore handling. The followings are build times on Ryzen 3975W: Before: ________________________________________________________ Executed in 165.93 secs fish external usr time 40.55 mins 2.71 millis 40.55 mins sys time 3.34 mins 36.40 millis 3.34 mins After: ________________________________________________________ Executed in 36.04 secs fish external usr time 336.42 secs 0.00 millis 336.42 secs sys time 36.65 secs 43.95 millis 36.61 secs Wallclock time is reduced 5x and CPU time 7x.	2024-08-25 00:47:58 -10:00
Andrea Righi	894f9582d0	scx_rustland_core: hide shutdown boilerplate in BpfScheduler Refactor the code to hide the shutdown handling inside BpfScheduler and simply use the exited() method to check when the scheduler is stopped. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 12:17:04 +02:00
Tejun Heo	152a8471cc	scx_bpfland: When reporting stats, use interval deltas Three of the reported stats are cumulative. While they obviously can be processed into delta values, that holds for the other direction too and the cumulative values are difficult to make intutive sense of. Report interval delta values instead. Note that a stats client can reliably build back cumulative values even under heavy system contention - the delta values reported between two consecutive reads are guaranteed to be correct regardless of the duration of the interval.	2024-08-24 23:14:57 -10:00
Tejun Heo	bd68e230b9	scx_bpfland: Convert to scx_stats Use scx_stats instead of prometheus for stats reporting. This has a few advantages: - Stats metadata can be defined more succinctly. - Natural support for nesting statistics which will be useful in making scheduler components composable. - Support for multiple programmable readers where each reader can use their own reading interval. - Built-in stats help message generation. - Openmetrics integration is still available through scx_stats/scripts/scxstats_to_openmetrics.py.	2024-08-24 23:14:55 -10:00
Tejun Heo	625381280c	scx_stats: Shorten exported names and add prelude module Let's make it a bit easier to use: - Shorten exported names by changing the prefix from ScxStats to Stats. This should be distinctive enough and more inline with how most libraries name their exports. - Importing the right set of traits can be tricky. Introduce prelude module so that importing is a bit less painful.	2024-08-24 22:04:25 -10:00
Andrea Righi	a2e97fecbb	scx_rustland_core: merge verbose and debug in the same option There is no reason to have two separate options for "verbose" and "debug" mode. Just merge the two and always use "debug". If enabled, increase verbosity to stdout and enable reporting BPF scheduling events in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:45:20 +02:00
Andrea Righi	cb16a11342	scx_rustland_core: get rid of the global scheduler's slice_us Since scx_rustland_core enables setting a time slice on a per-task basis during task dispatch, there's no need to maintain a global time slice in the BPF component. Instead, a global time slice can simply be managed in user-space, achieving the same outcome. Therefore, drop the global slice_us property from BpfScheduler to simplify the API. NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be used by default. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:45:18 +02:00
Andrea Righi	e404bee5e7	scx_rustland / scx_rlfifo: small code format fixes Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:44:52 +02:00
Andrea Righi	1cd11ba916	scx_rlfifo: improve documentation and code readability Add more comments to make the source code more understandable, so that it can be easily used as a template for implementing more complex scheduling policies. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:44:28 +02:00
Tejun Heo	35a4326aee	scx_lavd: Drop unnecessary stat field explanation on startup The scheduling instances no longer prints out sched samples. No reason to print field explanation on startup.	2024-08-24 18:48:54 -10:00
Changwoo Min	02ad793c78	Merge branch 'main' into htejun/scx_lavd-stats	2024-08-25 11:57:41 +09:00
Changwoo Min	8b1874c27f	Merge pull request #552 from CachyOS/lavd-mutli-cxx2 scx_lavd: Drop message about unsupported multi-CXX support	2024-08-25 11:48:12 +09:00
Tejun Heo	fdfb7f60f4	Merge branch 'main' into htejun/scx_lavd-stats	2024-08-24 15:53:53 -10:00
Tejun Heo	55e5b8b43f	scx_lavd: Switch to scx_stats Scheduling sample reporting is switched to use scx_stats. This makes the scheduler run without making too much noise while still allowing monitoring on demand. It can also make introspection more dynamic - e.g. it shouldn't be difficult to add other monitoring commands which take scheduling samples based on different criteria or add other types of staisitcs. --nr_sched-samples is replaced with --monitor-nr-samples.	2024-08-24 15:53:02 -10:00
Tejun Heo	1bba713a29	Merge pull request #542 from sched-ext/htejun/scx_stats scx_stats, scx_rusty, scx_layered: Implement `--help-stats`	2024-08-24 15:38:36 -10:00
Peter Jung	906d054770	scx_lavd: Drop message about unsupported multi-CXX support Signed-off-by: Peter Jung <admin@ptr1337.dev>	2024-08-25 01:10:38 +02:00
Andrea Righi	0aa23481de	scx_rustland_core: drop update_tasks() and introduce notify_complete() The update_tasks() API is somewhat confusing, so replace it with a clearer API, notify_complete(). This new API will return control to the BPF component and inform it about the number of tasks still pending in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 00:45:23 +02:00
Daniel Hodges	e81faef103	Merge pull request #544 from hodgesds/layered-tgid scx_layered: Add layer match for tgid	2024-08-24 16:58:19 -04:00
Andrea Righi	5ece102554	scx_rustland: get rid of unnecessary debugging information Additional statistics will be re-added later via scx_stats. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	cef8ff8757	scx_rustland_core: get rid of the low_power API The low-power API is a bit of a hack implemented purely in the BPF layer, this should be better re-implemented with some concepts of topology awareness. Therefore, get rid of this API for now. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	be7ef1009b	scx_rlfifo: user-space idle CPU selection Select an idle CPU from user-space, instead of always dispatching on the first CPU available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	568e292a24	scx_rustland_core: get rid of the exiting task API The current API used to notify the user-space scheduler when a task exits is really confusing (setting a negative value in queued_task_ctx.cpu), and it's also possible to detect task exiting events from user-space (or check in procfs, even if it's slower). In any case, a better API should be provided for this, so drop the current one for now. NOTE: this will cause additional memory usage for scx_rustland, but it can be fixed/addressed later in a separate commit (i.e., providing a periodic garbage collector for the unused task entries). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	5d544ea264	scx_rustland_core: move CPU idle selection logic in user-space Allow user-space scheduler to pick an idle CPU via self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's select_cpu() iterface. Also remove the full_user option and always rely on the idle selection logic from user-space. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:28:13 +02:00
Andrea Righi	1dd329dd7d	scx_rustland: update Cargo.lock Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 20:24:48 +02:00
Andrea Righi	106d59d997	scx_rlfifo: update Cargo.lock Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 20:24:48 +02:00
Andrea Righi	016aae759f	Merge pull request #545 from sched-ext/bpfland-honor-avg-nvcsw scx_bpfland: always honor average nvcsw in lowlatency mode	2024-08-24 20:24:33 +02:00
Avraham Hollander	66b5dd0de9	Clean up scx_rusty help info a bit	2024-08-24 11:56:12 -04:00
Avraham Hollander	c34a470024	scx_lavd: Fix my own formatting error	2024-08-24 11:36:19 -04:00
Andrea Righi	5a08855a86	scx_bpfland: always honor average nvcsw in lowlatency mode Keep evaluating the average number of voluntary context switches for each task when lowlatency mode is enabled, even when interactive tasks classification is disabled (via `-c 0`). The average nvcsw is also used in lowlatency mode to evaluate the proportional bonus to the tasks' deadline and it shouldn't be ignored when interactive tasks classification is disabled. Moreover, make sure that such bonus never exceeds the starvation threshold. Keep in mind that it is still possible to disable the periodic average nvcsw evaluation with `-c 0`, without specifying `--lowlatency`. Fixes: `6a22853` ("scx_bpfland: introduce --lowlatency option") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 10:42:22 +02:00
Tejun Heo	48092c6f88	scx_lavd: Relay introspection output in stats::TaskSample This indirection doesn't make any visible behavior difference now but will be used to implement scx_stats support.	2024-08-23 18:49:36 -10:00
Tejun Heo	725fa7f1be	Merge branch 'main' into htejun/scx_stats	2024-08-23 17:10:08 -10:00
Daniel Hodges	5a2012763e	scx_layered: Add layer match for tgid Add layer match for tgid. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-23 23:00:28 -04:00
Avraham Hollander	bedb18b48e	Improve scx_lavd help info A lot of scx_lavd's options do not clearly explain what they do. Add some short explanations, clean up the existing ones, and direct the user to read the in-code documentation for more info.	2024-08-23 18:56:14 -04:00
Avraham Hollander	d6e27b59e7	Clean up scx_bpfland help info a bit	2024-08-23 18:55:04 -04:00
Tejun Heo	25e437753c	scx_layered, scx_rusty: Implement --help-stats which shows all the defined stats. While at it, make some cosmetic updates.	2024-08-23 12:39:47 -10:00
Tejun Heo	405bcc63fe	scx_stats: Make ScxStatsServerData a public carrier of data needed for stats server And move related ops into it. This is a bit more natural and will also allow doing other operaitons (e.g. describing stats) without launching the server.	2024-08-23 12:23:57 -10:00
Tejun Heo	7bd35b6cd3	scx_lavd: Cargo.lock update (caused by scx_utils depending on scx_stats)	2024-08-23 09:21:44 -10:00
Andrea Righi	e72676ede3	Merge pull request #540 from sched-ext/bpfland-cpufreq-awareness scx_bpfland: cpu frequency and energy awareness	2024-08-23 21:17:34 +02:00
Tejun Heo	9e3b4e6db0	scx_stats: A bit of cleanups and renames	2024-08-23 09:09:02 -10:00
Tejun Heo	b6ccb87bec	Merge pull request #539 from sched-ext/htejun/scx_rusty scx_rusty: Convert to scx_stats	2024-08-23 08:42:47 -10:00
Daniel Hodges	7d45059fa9	Merge pull request #538 from hodgesds/layered-pid scx_layered: Add pid/ppid matches	2024-08-23 14:08:40 -04:00
Tejun Heo	8c8912ccea	Merge branch 'main' into htejun/scx_rusty	2024-08-23 07:50:23 -10:00
Andrea Righi	50684e4569	scx_bpfland: introduce Intel Turbo Boost awareness Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize them over the primary scheduling domain when the energy model `balance_power` is used (typically when running on battery power with the "balanced" profile). With this change the scheduling hierarchy becomes the following: 1) CPUs in the turbo scheduling domain 2) CPUs in the primary scheduling domain 3) full-idle SMT CPUs 4) CPUs in the same L2 cache 5) CPUs in the same L3 cache 6) CPUs in the task's allowed domain And the idle selection logic is modified as following: - In the turbo scheduling domain: - pick same full-idle SMT CPU - pick any other full-idle SMT CPU sharing the same L2 cache - pick any other full-idle SMT CPU sharing the same L3 cache - pick any other full-idle SMT CPU - pick same idle CPU - pick any other idle CPU sharing the same L2 cache - pick any other idle CPU sharing the same L3 cache - pick any other idle SMT CPU - In the primary scheduling domain: - pick same full-idle SMT CPU - pick any other full-idle SMT CPU sharing the same L2 cache - pick any other full-idle SMT CPU sharing the same L3 cache - pick any other full-idle SMT CPU - pick same idle CPU - pick any other idle CPU sharing the same L2 cache - pick any other idle CPU sharing the same L3 cache - pick any other idle SMT CPU - In the entire task domain: - pick any other idle CPU Keep in mind that the turbo domain will be evaluated only when the scheduler is started with `--primary-domain auto` and only when the `balance_power` energy profile is used. The turbo domain is always made using the subset of CPUs in the system with the highest max frequency. If such subset can't be determined (for example if all the CPUs in the primary domain have all the same frequency), the turbo domain will be ignored. Prioritizing turbo boosted CPUs can help to improve performance by forcing the governor to scale up their frequency, without increasing too much power consumption, due to the fact that tasks will be preferably confined into a reduced amount of cores. This change seems to improve performance, without increasing much power consuption, on Intel laptops while using the `balanced_power` energy profile. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-23 19:49:08 +02:00
Andrea Righi	d958dd4482	scx_bpfland: introduce dynamic energy profile Introduce the new option `--primary-domain auto`. With this option the scheduler will dynamically adjusts the primary scheduling domain at run-time, in function of the current energy profile reported in /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference. When the `power` energy profile is selected, the primary scheduling domain will prioritize E-cores. Alternatively, when the `performance` profile is selected, it will prioritize P-cores. For all the other energy profiles, all the CPUs in the system will be used. Note that this option is only relevant on hybrid architectures with P-cores and E-cores. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-23 19:49:01 +02:00
Tejun Heo	44a0f1b124	scx_utils: Factor out monitor_stats() from scx_rusty and scx_layered	2024-08-23 06:46:19 -10:00
Tejun Heo	ae3024e938	scx_layered: Add --stats and make --monitor behavior consistent with scx_rusty	2024-08-23 05:52:52 -10:00
Tejun Heo	0f04a93dd1	scx_rusty: Add stat descriptions and make minor adjustments	2024-08-23 05:46:13 -10:00
Tejun Heo	36865234f8	scx_rusty: Add scx_stats annotations necessary for openmetrics translation	2024-08-23 04:59:08 -10:00
Tejun Heo	2f3f473cd3	scx_rusty: Improve timestamp reporting	2024-08-23 04:31:27 -10:00
Daniel Hodges	11b978a892	scx_layered: Add pid/ppid matches Add matches for pid/ppid. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-23 07:20:05 -07:00
Tejun Heo	76934f3aab	scx_rusty: Convert to scx_stats This allows scx_rusty to avoid generating excessive logs for statistics while still allowing detailed monitoring on demand.	2024-08-22 19:44:12 -10:00
Tejun Heo	16c07a5cd9	scx_rusty: Don't reset bpf_stats, remember prev states and calculate delta This will ease transition to scx_stats.	2024-08-22 13:02:23 -10:00
Tejun Heo	13fa48a871	scx_rusty: Separate out stats generation and formatting to prepare for scx_stats conversion.	2024-08-22 10:03:10 -10:00
Tejun Heo	b4564520e5	scx_rusty: Simplify Stats structs and take id out of the structs to prepare for scx_stats conversion. While at it, make some cosmetic changes.	2024-08-22 08:45:33 -10:00
Andrea Righi	6a2285398d	scx_bpfland: introduce --lowlatency option Introduce the new `--lowlatency` option, which enables switching between the default pure vruntime-based scheduling (more optimized for server workloads) and a deadline-based scheduling (better suited for low-latency workloads). When the low-latency mode is activated, a task's deadline is calculated as its vruntime, adjusted by a bonus proportional to the task's average number of voluntary context switches (the more voluntary context switches, the shorter the deadline). This feature enhances the prioritization of interactive tasks even more, proportionally to their average voluntary context switches, also within the two main global queues (priority / shared) and it helps to maintain interactive workloads always responsive, even in presence of heavy non-interactive background work. Low-latency mode allows to prevent audio cracking even in presence of a large amount of short-lived tasks with pseudo-interactive behavior (i.e, hackbench) and it enables achieving approximately a +33% average frames-per-second (FPS) in the typical "gaming while building the kernel" benchmark. However, it can also amplify the de-prioritization of CPU-intensive tasks, making this option more suitable for specific low-latency scenarios. Therefore the low-latency mode is disabled by default and it can only be enabled via the `--lowlatency` option. Tested-by: Piotr Gorski (piotrgorski@cachyos.org) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-22 13:26:19 +02:00
Tejun Heo	4834dec684	scx_rusty: Move stats structs to stats.rs and rename for consistency	2024-08-21 22:04:38 -10:00
Andrea Righi	b0a8e4a91e	scx_bpfland: better time slice control Explicitly replenish the task's time slice from ops.dispatch() if the task still wants to run and no other task is selected. In this way the sched_ext core won't automatically re-schedule the task on the same CPU, implicitly assigning a time slice of SCX_SLICE_DFL. Moreover, instead of determining the task time slice in ops.enqueue(), refresh the time slice immediately before the task is started on its assigned CPU in ops.running(). This allows to use a more precise time slice, adjusted based on the actual amount of tasks that are currently waiting to be scheduled. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-22 09:23:37 +02:00
Tejun Heo	d6ac5fbd9c	scx_layered: Drop SCX_OPS_ENQ_LAST The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and enqueueing on local DSQ will no longer be sufficient to avoid stalls. No reason to do it anyway. Just drop it.	2024-08-21 13:13:59 -10:00
Tejun Heo	f726f0b73b	Version: Cargo.lock	2024-08-21 06:45:19 -10:00
Tejun Heo	4d1f0639d8	Version: v1.0.3	2024-08-21 06:42:11 -10:00

... 2 3 4 5 6 ...

1129 Commits