JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-12-03 06:17:11 +00:00

Author	SHA1	Message	Date
Daniel Hodges	7e0329e45c	scx_layered: Add layer growth config Add a per layer config for different implementations of layer growth algorithms. Convert the existing default logic into a default layer growth algorithm and add a linear implementation. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 19:17:24 -07:00
Daniel Hodges	cf765562c7	scx_layered: Update docs for layer slice setting Add docs for layer slice setting. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 22:12:07 -04:00
Daniel Hodges	a23308e7b0	scx_layered: Add more docs on tuning Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 12:38:05 -07:00
Daniel Hodges	96326b1ef3	scx_layered: Add additional docs Add some additional docs on tuning layered. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 12:27:26 -07:00
Daniel Hodges	cc450f1a4b	scx_layered: Add per layer timeslice Allow setting a different timeslice per layer. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 11:21:03 -07:00
Daniel Hodges	c511b42b7b	scx_layered: Make verification easier on older kernels Refactor some BPF code to make verification easier on older kernels. This is to make it easier to maintain backports. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 08:05:10 -07:00
Daniel Hodges	12f8cb74b5	scx_utils: Add GPU topology Add GPU awareness to the topology crate. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-28 06:35:35 -07:00
Andrea Righi	28cb1ec5cb	scx_bpfland: enhanced task affinity Aggressively try to keep tasks running on the same CPU / cache / domain, to achieve higher performance when the system is not over commissioned. This is done by giving a second chance in ops.enqueue(), in addition to ops.select_cpu(), to find an idle CPU close to the previously used CPU. Moreover, even if the task is dispatched to the global DSQs, always try to check if there is an idle CPU in the primary domain that can immediately consume the task. = Results = This change seems to provide a minor, but consistent, boost of performance with the CPU-intensive benchmarks from the CachyOS benchmarks selection [1]. Similar results can also be noticed with some WebGL benchmarks [2], when system usage is close to its maximum capacity. Test: - cachyos-benchmarker System: - AMD Ryzen 7 5800X 8-Core Processor Metrics: - total time: elapsed time of all benchmarks - total score: geometric mean of all benchmarks NOTE: total time is the most relevant, since it gives a measure of the aggregate performance, while the total score emphasizes more on performance consistency across all benchmarks. == Results: summary == +-------------------------+---------------------+---------------------+ \| Scheduler \| Total Time \| Total Score \| \| \| (less = better) \| (less = better) \| +-------------------------+---------------------+---------------------+ \| EEVDF \| 624.44 sec \| 123.68 \| \| bpfland \| 625.34 sec \| 122.21 \| \| bpfland-task-affinity \| 623.67 sec \| 122.27 \| +-------------------------+---------------------+---------------------+ == Conclusion == With this patch applied, bpfland shows both a better performance and consistency. Although the gains are small (less than 1%), they are still significant for this type of benchmark and consistently appear across multiple runs. [1] https://github.com/CachyOS/cachyos-benchmarker [2] https://webglsamples.org/aquarium/aquarium.html Tested-by: Piotr Gorski < piotr.gorski@cachyos.org > Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 10:30:54 +02:00
Avraham Hollander	6c5d85401d	Merge branch 'sched-ext:main' into main	2024-08-27 23:07:54 -04:00
Avraham Hollander	2a3cbeb760	scx_lavd: Add same power mode clarification to --no-prefer-turbo-core	2024-08-27 23:06:31 -04:00
Changwoo Min	5588126cff	scx_lavd: minior optimization for consume_task() When iterating neighbors, the existing code unnecessarily iterates all the neighbors to the maximum even if there is no neighors. So the fix escapes early when there is no neighbors. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-28 10:26:50 +09:00
Changwoo Min	95272ae910	scx_lavd: proper handling of ctrl-c in a monitoring mode Ctrl-c wasn't properly handled in the monitoring mode (`--monitor-sched-samples`), so the scheduler could not be terminated by pressing ctrl-c. The missing ctrl-c handling is added to the monitor thread. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-28 10:05:34 +09:00
Changwoo Min	9c4428fd8b	scx_lavd: remove unused rust functions Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-28 10:02:11 +09:00
Andrea Righi	a155d5185d	scx_bpfland: rely on Topology to classify core types Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs. Moreover, support the special keyword "all" with --primary-domain to include all the CPUs in the system (default). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 00:23:55 +02:00
Andrea Righi	872e653cd2	scx_utils: introduce Turbo core type to Topology Integrate the logic used by scx_bpfland to detect turbo-boosted cores in Topology. Also change the logic to detect Big/Little cores in function of base_frequency, instead of scaling_max_freq, otherwise turbo-boosted cores in homogeneous systems may be incorrectly classified as Big. Moreover, introduce the following new methods to Cpu to check for the core type: - is_turbo(): return true if the CPU is Turbo, false otherwise - is_big(): return true if the CPU is either Turbo or Big - is_little(): return true if the CPU is Little Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 00:09:08 +02:00
Daniel Hodges	41cebb807a	Merge pull request #569 from anh0516/main scx_layered: Clean up in-code documentation; add commas for consistency	2024-08-27 09:47:29 -04:00
Andrea Righi	6768f9f88c	Merge pull request #572 from sched-ext/bpfland-fix-turbo-domain scx_bpfland: fix turbo boost domain nullifying primary domain limits	2024-08-27 15:23:12 +02:00
Andrea Righi	e0f49a338a	scx_bpfland: fix turbo boost domain nullifying primary domain limits When creating the turbo boost scheduling domain, we might use a full CPU mask (selecting all possible CPUs) to indicate "do not prioritize turbo boost CPUs" or when all CPUs have the same maximum frequency. This approach works when the primary domain also contains all the CPUs, as the complete overlap allows the CPU selection logic to ignore the turbo boost domain and start picking CPUs directly from the primary domain. However, if the primary domain doesn't include all CPUs, the two domains won't fully overlap, which can lead to the turbo boost domain incorrectly including all CPUs, thereby negating the restrictions set by the primary scheduling domain. To resolve this, an empty CPU mask should be used for the turbo boost domain when turbo boost CPUs aren't prioritized. If the turbo boost domain is empty, it should be entirely bypassed, and the selection should proceed directly to the primary domain. Reported-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-27 13:36:50 +02:00
Changwoo Min	00430c3ded	scx_lavd: make a loop easier to correctly verify With an ill combination of old kernel and old LLVM, the BPF verifier incorrectly detects an infinite loop. After changing the loop with a constant end, the old verifier can pass the code. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-27 17:11:20 +09:00
Changwoo Min	09cff560aa	Merge pull request #566 from multics69/lavd-turbo scx_lavd: prioritize the turbo boost-able cores	2024-08-27 08:47:25 +09:00
Daniel Hodges	83cd26eb9e	Merge pull request #564 from hodgesds/layered-help scx_layered: Update help for tgid matching	2024-08-26 14:52:53 -04:00
Andrea Righi	35db89e90d	Merge pull request #568 from sched-ext/rustland-core-design-improv scx_rustland_core: small core design improvements	2024-08-26 20:06:21 +02:00
Avraham Hollander	7a43801d76	Add quotes for clarity	2024-08-26 13:20:01 -04:00
Avraham Hollander	0b6ebf826e	scx_lavd, scx_mitosis, scx_rusty: Add comma for grammatical consistency with the same change in the other schedulers	2024-08-26 13:06:58 -04:00
Avraham Hollander	07039f1f07	scx_layered: Documentation cleanup	2024-08-26 13:03:52 -04:00
Andrea Righi	1427d7d347	scx_rlfifo: enhance code design Refactor the code design to make it more suitable as a template for implementing advanced scheduling policies. In particular, create separate loops for task consumption and task dispatching. This will make the scheduler easier to adapt as a foundation for implementing more complex scheduling policies. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-26 16:10:54 +02:00
Daniel Hodges	c45c2de39f	scx_layered: Update help for tgid matching Forgot to add doc for tgid matching Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-26 07:06:21 -07:00
Changwoo Min	9807e561f0	scx_lavd: prioritize the turbo boost-able cores Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:57:33 +09:00
Changwoo Min	cd5b2bf664	scx_lavd: replace nix signal handler to ctrlc Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:57:33 +09:00
Changwoo Min	e887c56da0	scx_lavd: add "--version" option, which prints the current version Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:57:33 +09:00
Changwoo Min	0f97ca3066	scx_lavd: drop time slice calculation in ops.select_cpu() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 17:55:00 +09:00
Changwoo Min	4e3c36ca3f	scx_lavd: handle the missing cases in time slice calculation Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	be7d06e280	scx_lavd: make the old BPF verifier happy :-( Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	82f55b95b2	scx_lavd: add a fast path in pick_idle_cpu() when SMT is not activated Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	38779dbe8b	scx_lavd: improve pick_idle_cpu() Now it checks an active cpumask within a previous core's compute domain before checking the full active CPUs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	d1d9e97d08	scx_lavd: reduce LAVD_CPDOM_MAX_DIST to 4 The BPF verifier in the old kernel gives up to analysis the nested loop in the consume_task(). We reduce the loop less complex by reducing LAVD_CPDOM_MAX_DIST from 6 to 4 in order to make the verifier happy. Note that the theoretical maximum distance is 6 (numa > llc > core type) but there is no such hardware today, hence reducing it to 6 should be okay in next few years, when hopefully the verifier becomes smarter. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	950710990f	scx_lavd: move time slice calculation to ops.enqueue() and ops.select_cpu() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	954b684a70	scx_lavd: update nr_queued_task every system stat update interval Updating nr_queue_task every runqueue operation is expensive and unnecessary. So we do update every system state update interval and use moving average, which is accurate enough. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	4f906f1f49	scx_lavd: update README since it supports multi-CCX/NUMA Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	9551657b42	scx_lavd: prefer big cores in the performance mode Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	d4bb35e651	scx_lavd: use itertools::iproduct!() for a nested loop Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Changwoo Min	9368c6881d	scx_lavd: replace get_task_cpu_id() to scx_bpf_task_cpu() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-26 11:43:29 +09:00
Andrea Righi	a469f0f1ce	Merge pull request #561 from sched-ext/bpfland-fix-energy-profile-refresh scx_bpfland: prevent reading energy profile if not available	2024-08-25 18:31:34 +02:00
Tejun Heo	ca13e13ad6	Merge pull request #559 from sched-ext/htejun/cargo-workspace build: Use workspace to group rust sub-projects	2024-08-25 06:26:18 -10:00
Andrea Righi	f8acd069f0	scx_bpfland: prevent reading energy profile if not available Avoid to periodically read the current performance profile from /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if it's not available (i.e., with older CPUs or kernels without cpufreq). This fixes issue #560. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 16:53:35 +02:00
Andrea Righi	8853d9a9f2	Merge pull request #548 from sched-ext/rustland-core-refactoring scx_rustland_core: user-space framework refactoring	2024-08-25 16:39:28 +02:00
Tejun Heo	43950c65bd	build: Use workspace to group rust sub-projects meson build script was building each rust sub-project under rust/ and scheds/rust/ separately. This means that each rust project is built independently which leads to a couple problems - 1. There are a lot of shared dependencies but they have to be built over and over again for each proejct. 2. Concurrency management becomes sad - we either have to unleash multiple cargo builds at the same time possibly thrashing the system or build one by one. We've been trying to solve this from meson side in vain. Thankfully, in issue #546, @vimproved suggested using cargo workspace which makes the sub-projects share the same target directory and built together by the same cargo instance while still allowing each project to behave independently for development and publishing purposes. Make the following changes: - Create two cargo workspaces - one under rust/, the other under scheds/rust/. Each contains all rust projects underneath it. - Don't let meson descend into rust/. These are libraries used by the rust schedulers. No need to build them from meson. Cargo will build them as needed. - Change the rust_scheds build target to invoke `cargo build` in scheds/rust/ and let cargo do its thing. - Remove per-scheduler meson.build files and instead generate custom_targets in scheds/rust/meson.build which invokes `cargo build -p $SCHED`. - This changes rust binary directory. Update README and meson-scripts/install_rust_user_scheds accordingly. - Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all schedulers now. - Unify .gitignore handling. The followings are build times on Ryzen 3975W: Before: ________________________________________________________ Executed in 165.93 secs fish external usr time 40.55 mins 2.71 millis 40.55 mins sys time 3.34 mins 36.40 millis 3.34 mins After: ________________________________________________________ Executed in 36.04 secs fish external usr time 336.42 secs 0.00 millis 336.42 secs sys time 36.65 secs 43.95 millis 36.61 secs Wallclock time is reduced 5x and CPU time 7x.	2024-08-25 00:47:58 -10:00
Andrea Righi	894f9582d0	scx_rustland_core: hide shutdown boilerplate in BpfScheduler Refactor the code to hide the shutdown handling inside BpfScheduler and simply use the exited() method to check when the scheduler is stopped. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 12:17:04 +02:00
Tejun Heo	152a8471cc	scx_bpfland: When reporting stats, use interval deltas Three of the reported stats are cumulative. While they obviously can be processed into delta values, that holds for the other direction too and the cumulative values are difficult to make intutive sense of. Report interval delta values instead. Note that a stats client can reliably build back cumulative values even under heavy system contention - the delta values reported between two consecutive reads are guaranteed to be correct regardless of the duration of the interval.	2024-08-24 23:14:57 -10:00
Tejun Heo	bd68e230b9	scx_bpfland: Convert to scx_stats Use scx_stats instead of prometheus for stats reporting. This has a few advantages: - Stats metadata can be defined more succinctly. - Natural support for nesting statistics which will be useful in making scheduler components composable. - Support for multiple programmable readers where each reader can use their own reading interval. - Built-in stats help message generation. - Openmetrics integration is still available through scx_stats/scripts/scxstats_to_openmetrics.py.	2024-08-24 23:14:55 -10:00
Tejun Heo	625381280c	scx_stats: Shorten exported names and add prelude module Let's make it a bit easier to use: - Shorten exported names by changing the prefix from ScxStats to Stats. This should be distinctive enough and more inline with how most libraries name their exports. - Importing the right set of traits can be tricky. Introduce prelude module so that importing is a bit less painful.	2024-08-24 22:04:25 -10:00
Andrea Righi	a2e97fecbb	scx_rustland_core: merge verbose and debug in the same option There is no reason to have two separate options for "verbose" and "debug" mode. Just merge the two and always use "debug". If enabled, increase verbosity to stdout and enable reporting BPF scheduling events in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:45:20 +02:00
Andrea Righi	cb16a11342	scx_rustland_core: get rid of the global scheduler's slice_us Since scx_rustland_core enables setting a time slice on a per-task basis during task dispatch, there's no need to maintain a global time slice in the BPF component. Instead, a global time slice can simply be managed in user-space, achieving the same outcome. Therefore, drop the global slice_us property from BpfScheduler to simplify the API. NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be used by default. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:45:18 +02:00
Andrea Righi	e404bee5e7	scx_rustland / scx_rlfifo: small code format fixes Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:44:52 +02:00
Andrea Righi	1cd11ba916	scx_rlfifo: improve documentation and code readability Add more comments to make the source code more understandable, so that it can be easily used as a template for implementing more complex scheduling policies. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 09:44:28 +02:00
Tejun Heo	35a4326aee	scx_lavd: Drop unnecessary stat field explanation on startup The scheduling instances no longer prints out sched samples. No reason to print field explanation on startup.	2024-08-24 18:48:54 -10:00
Changwoo Min	02ad793c78	Merge branch 'main' into htejun/scx_lavd-stats	2024-08-25 11:57:41 +09:00
Changwoo Min	8b1874c27f	Merge pull request #552 from CachyOS/lavd-mutli-cxx2 scx_lavd: Drop message about unsupported multi-CXX support	2024-08-25 11:48:12 +09:00
Tejun Heo	fdfb7f60f4	Merge branch 'main' into htejun/scx_lavd-stats	2024-08-24 15:53:53 -10:00
Tejun Heo	55e5b8b43f	scx_lavd: Switch to scx_stats Scheduling sample reporting is switched to use scx_stats. This makes the scheduler run without making too much noise while still allowing monitoring on demand. It can also make introspection more dynamic - e.g. it shouldn't be difficult to add other monitoring commands which take scheduling samples based on different criteria or add other types of staisitcs. --nr_sched-samples is replaced with --monitor-nr-samples.	2024-08-24 15:53:02 -10:00
Tejun Heo	1bba713a29	Merge pull request #542 from sched-ext/htejun/scx_stats scx_stats, scx_rusty, scx_layered: Implement `--help-stats`	2024-08-24 15:38:36 -10:00
Peter Jung	906d054770	scx_lavd: Drop message about unsupported multi-CXX support Signed-off-by: Peter Jung <admin@ptr1337.dev>	2024-08-25 01:10:38 +02:00
Andrea Righi	0aa23481de	scx_rustland_core: drop update_tasks() and introduce notify_complete() The update_tasks() API is somewhat confusing, so replace it with a clearer API, notify_complete(). This new API will return control to the BPF component and inform it about the number of tasks still pending in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 00:45:23 +02:00
Daniel Hodges	e81faef103	Merge pull request #544 from hodgesds/layered-tgid scx_layered: Add layer match for tgid	2024-08-24 16:58:19 -04:00
Andrea Righi	5ece102554	scx_rustland: get rid of unnecessary debugging information Additional statistics will be re-added later via scx_stats. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	cef8ff8757	scx_rustland_core: get rid of the low_power API The low-power API is a bit of a hack implemented purely in the BPF layer, this should be better re-implemented with some concepts of topology awareness. Therefore, get rid of this API for now. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	be7ef1009b	scx_rlfifo: user-space idle CPU selection Select an idle CPU from user-space, instead of always dispatching on the first CPU available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	568e292a24	scx_rustland_core: get rid of the exiting task API The current API used to notify the user-space scheduler when a task exits is really confusing (setting a negative value in queued_task_ctx.cpu), and it's also possible to detect task exiting events from user-space (or check in procfs, even if it's slower). In any case, a better API should be provided for this, so drop the current one for now. NOTE: this will cause additional memory usage for scx_rustland, but it can be fixed/addressed later in a separate commit (i.e., providing a periodic garbage collector for the unused task entries). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:29:10 +02:00
Andrea Righi	5d544ea264	scx_rustland_core: move CPU idle selection logic in user-space Allow user-space scheduler to pick an idle CPU via self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's select_cpu() iterface. Also remove the full_user option and always rely on the idle selection logic from user-space. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 21:28:13 +02:00
Andrea Righi	1dd329dd7d	scx_rustland: update Cargo.lock Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 20:24:48 +02:00
Andrea Righi	106d59d997	scx_rlfifo: update Cargo.lock Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 20:24:48 +02:00
Andrea Righi	016aae759f	Merge pull request #545 from sched-ext/bpfland-honor-avg-nvcsw scx_bpfland: always honor average nvcsw in lowlatency mode	2024-08-24 20:24:33 +02:00
Avraham Hollander	66b5dd0de9	Clean up scx_rusty help info a bit	2024-08-24 11:56:12 -04:00
Avraham Hollander	c34a470024	scx_lavd: Fix my own formatting error	2024-08-24 11:36:19 -04:00
Andrea Righi	5a08855a86	scx_bpfland: always honor average nvcsw in lowlatency mode Keep evaluating the average number of voluntary context switches for each task when lowlatency mode is enabled, even when interactive tasks classification is disabled (via `-c 0`). The average nvcsw is also used in lowlatency mode to evaluate the proportional bonus to the tasks' deadline and it shouldn't be ignored when interactive tasks classification is disabled. Moreover, make sure that such bonus never exceeds the starvation threshold. Keep in mind that it is still possible to disable the periodic average nvcsw evaluation with `-c 0`, without specifying `--lowlatency`. Fixes: `6a22853` ("scx_bpfland: introduce --lowlatency option") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 10:42:22 +02:00
Tejun Heo	48092c6f88	scx_lavd: Relay introspection output in stats::TaskSample This indirection doesn't make any visible behavior difference now but will be used to implement scx_stats support.	2024-08-23 18:49:36 -10:00
Tejun Heo	725fa7f1be	Merge branch 'main' into htejun/scx_stats	2024-08-23 17:10:08 -10:00
Daniel Hodges	5a2012763e	scx_layered: Add layer match for tgid Add layer match for tgid. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-23 23:00:28 -04:00
Avraham Hollander	bedb18b48e	Improve scx_lavd help info A lot of scx_lavd's options do not clearly explain what they do. Add some short explanations, clean up the existing ones, and direct the user to read the in-code documentation for more info.	2024-08-23 18:56:14 -04:00
Avraham Hollander	d6e27b59e7	Clean up scx_bpfland help info a bit	2024-08-23 18:55:04 -04:00
Tejun Heo	25e437753c	scx_layered, scx_rusty: Implement --help-stats which shows all the defined stats. While at it, make some cosmetic updates.	2024-08-23 12:39:47 -10:00
Tejun Heo	405bcc63fe	scx_stats: Make ScxStatsServerData a public carrier of data needed for stats server And move related ops into it. This is a bit more natural and will also allow doing other operaitons (e.g. describing stats) without launching the server.	2024-08-23 12:23:57 -10:00
Tejun Heo	7bd35b6cd3	scx_lavd: Cargo.lock update (caused by scx_utils depending on scx_stats)	2024-08-23 09:21:44 -10:00
Andrea Righi	e72676ede3	Merge pull request #540 from sched-ext/bpfland-cpufreq-awareness scx_bpfland: cpu frequency and energy awareness	2024-08-23 21:17:34 +02:00
Tejun Heo	9e3b4e6db0	scx_stats: A bit of cleanups and renames	2024-08-23 09:09:02 -10:00
Tejun Heo	b6ccb87bec	Merge pull request #539 from sched-ext/htejun/scx_rusty scx_rusty: Convert to scx_stats	2024-08-23 08:42:47 -10:00
Daniel Hodges	7d45059fa9	Merge pull request #538 from hodgesds/layered-pid scx_layered: Add pid/ppid matches	2024-08-23 14:08:40 -04:00
Tejun Heo	8c8912ccea	Merge branch 'main' into htejun/scx_rusty	2024-08-23 07:50:23 -10:00
Andrea Righi	50684e4569	scx_bpfland: introduce Intel Turbo Boost awareness Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize them over the primary scheduling domain when the energy model `balance_power` is used (typically when running on battery power with the "balanced" profile). With this change the scheduling hierarchy becomes the following: 1) CPUs in the turbo scheduling domain 2) CPUs in the primary scheduling domain 3) full-idle SMT CPUs 4) CPUs in the same L2 cache 5) CPUs in the same L3 cache 6) CPUs in the task's allowed domain And the idle selection logic is modified as following: - In the turbo scheduling domain: - pick same full-idle SMT CPU - pick any other full-idle SMT CPU sharing the same L2 cache - pick any other full-idle SMT CPU sharing the same L3 cache - pick any other full-idle SMT CPU - pick same idle CPU - pick any other idle CPU sharing the same L2 cache - pick any other idle CPU sharing the same L3 cache - pick any other idle SMT CPU - In the primary scheduling domain: - pick same full-idle SMT CPU - pick any other full-idle SMT CPU sharing the same L2 cache - pick any other full-idle SMT CPU sharing the same L3 cache - pick any other full-idle SMT CPU - pick same idle CPU - pick any other idle CPU sharing the same L2 cache - pick any other idle CPU sharing the same L3 cache - pick any other idle SMT CPU - In the entire task domain: - pick any other idle CPU Keep in mind that the turbo domain will be evaluated only when the scheduler is started with `--primary-domain auto` and only when the `balance_power` energy profile is used. The turbo domain is always made using the subset of CPUs in the system with the highest max frequency. If such subset can't be determined (for example if all the CPUs in the primary domain have all the same frequency), the turbo domain will be ignored. Prioritizing turbo boosted CPUs can help to improve performance by forcing the governor to scale up their frequency, without increasing too much power consumption, due to the fact that tasks will be preferably confined into a reduced amount of cores. This change seems to improve performance, without increasing much power consuption, on Intel laptops while using the `balanced_power` energy profile. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-23 19:49:08 +02:00
Andrea Righi	d958dd4482	scx_bpfland: introduce dynamic energy profile Introduce the new option `--primary-domain auto`. With this option the scheduler will dynamically adjusts the primary scheduling domain at run-time, in function of the current energy profile reported in /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference. When the `power` energy profile is selected, the primary scheduling domain will prioritize E-cores. Alternatively, when the `performance` profile is selected, it will prioritize P-cores. For all the other energy profiles, all the CPUs in the system will be used. Note that this option is only relevant on hybrid architectures with P-cores and E-cores. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-23 19:49:01 +02:00
Tejun Heo	44a0f1b124	scx_utils: Factor out monitor_stats() from scx_rusty and scx_layered	2024-08-23 06:46:19 -10:00
Tejun Heo	ae3024e938	scx_layered: Add --stats and make --monitor behavior consistent with scx_rusty	2024-08-23 05:52:52 -10:00
Tejun Heo	0f04a93dd1	scx_rusty: Add stat descriptions and make minor adjustments	2024-08-23 05:46:13 -10:00
Tejun Heo	36865234f8	scx_rusty: Add scx_stats annotations necessary for openmetrics translation	2024-08-23 04:59:08 -10:00
Tejun Heo	2f3f473cd3	scx_rusty: Improve timestamp reporting	2024-08-23 04:31:27 -10:00
Daniel Hodges	11b978a892	scx_layered: Add pid/ppid matches Add matches for pid/ppid. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-23 07:20:05 -07:00
Tejun Heo	76934f3aab	scx_rusty: Convert to scx_stats This allows scx_rusty to avoid generating excessive logs for statistics while still allowing detailed monitoring on demand.	2024-08-22 19:44:12 -10:00
Tejun Heo	16c07a5cd9	scx_rusty: Don't reset bpf_stats, remember prev states and calculate delta This will ease transition to scx_stats.	2024-08-22 13:02:23 -10:00
Tejun Heo	13fa48a871	scx_rusty: Separate out stats generation and formatting to prepare for scx_stats conversion.	2024-08-22 10:03:10 -10:00
Tejun Heo	b4564520e5	scx_rusty: Simplify Stats structs and take id out of the structs to prepare for scx_stats conversion. While at it, make some cosmetic changes.	2024-08-22 08:45:33 -10:00
Andrea Righi	6a2285398d	scx_bpfland: introduce --lowlatency option Introduce the new `--lowlatency` option, which enables switching between the default pure vruntime-based scheduling (more optimized for server workloads) and a deadline-based scheduling (better suited for low-latency workloads). When the low-latency mode is activated, a task's deadline is calculated as its vruntime, adjusted by a bonus proportional to the task's average number of voluntary context switches (the more voluntary context switches, the shorter the deadline). This feature enhances the prioritization of interactive tasks even more, proportionally to their average voluntary context switches, also within the two main global queues (priority / shared) and it helps to maintain interactive workloads always responsive, even in presence of heavy non-interactive background work. Low-latency mode allows to prevent audio cracking even in presence of a large amount of short-lived tasks with pseudo-interactive behavior (i.e, hackbench) and it enables achieving approximately a +33% average frames-per-second (FPS) in the typical "gaming while building the kernel" benchmark. However, it can also amplify the de-prioritization of CPU-intensive tasks, making this option more suitable for specific low-latency scenarios. Therefore the low-latency mode is disabled by default and it can only be enabled via the `--lowlatency` option. Tested-by: Piotr Gorski (piotrgorski@cachyos.org) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-22 13:26:19 +02:00
Tejun Heo	4834dec684	scx_rusty: Move stats structs to stats.rs and rename for consistency	2024-08-21 22:04:38 -10:00
Andrea Righi	b0a8e4a91e	scx_bpfland: better time slice control Explicitly replenish the task's time slice from ops.dispatch() if the task still wants to run and no other task is selected. In this way the sched_ext core won't automatically re-schedule the task on the same CPU, implicitly assigning a time slice of SCX_SLICE_DFL. Moreover, instead of determining the task time slice in ops.enqueue(), refresh the time slice immediately before the task is started on its assigned CPU in ops.running(). This allows to use a more precise time slice, adjusted based on the actual amount of tasks that are currently waiting to be scheduled. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-22 09:23:37 +02:00
Tejun Heo	d6ac5fbd9c	scx_layered: Drop SCX_OPS_ENQ_LAST The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and enqueueing on local DSQ will no longer be sufficient to avoid stalls. No reason to do it anyway. Just drop it.	2024-08-21 13:13:59 -10:00
Tejun Heo	f726f0b73b	Version: Cargo.lock	2024-08-21 06:45:19 -10:00
Tejun Heo	4d1f0639d8	Version: v1.0.3	2024-08-21 06:42:11 -10:00
Andrea Righi	fedfee0bd6	scx_bpfland: drop unused variable With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in init_primary_domain(), so drop the variable to fix the following build warning: warning: unused variable: `topo` --> src/main.rs:385:9 \| 385 \| topo: &Topology, \| ^^^^ help: if this is intentional, prefix it with an underscore: `_topo` \| = note: `#[warn(unused_variables)]` on by default Fixes: `1da249f` ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 17:46:12 +02:00
Andrea Righi	9f7a11bba6	Merge pull request #528 from sched-ext/bpfland-turbo-boost scx_bpfland: properly classify Intel Turbo Boost CPUs	2024-08-21 17:40:25 +02:00
Daniel Hodges	f2a6661a85	Merge pull request #524 from hodgesds/layered-core-fixes scx_layered: Fix core selection	2024-08-21 08:13:33 -04:00
Tejun Heo	9c62019c81	Merge pull request #527 from sched-ext/htejun/scx_utils scx_utils::cpumask,topology: Misc updates	2024-08-20 22:25:25 -10:00
Andrea Righi	695e3b25b0	scx_bpfland: classify CPUs depending of their the base frequency Use the base frequency, instead of maximum frequency, to classify fast and slow CPUs. This ensures accurate distinction between Intel Turbo Boost CPUs and genuinely faster CPUs when auto-detecting the primary scheduling domain. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 10:16:41 +02:00
Andrea Righi	e0fb99835d	Merge pull request #525 from sched-ext/bpfland-disable-interactive scx_bpfland: allow to completely disable interactive classification	2024-08-21 10:02:43 +02:00
Tejun Heo	5cf4212330	Revert "rusty: Integrate stats with the metrics framework" This reverts commit `83373b1f4e` in prepration for converting to scx_stats.	2024-08-20 21:59:25 -10:00
Tejun Heo	516a7590db	scx_rusty: Revert log_recorder conversion scx_rusty will be converted to scx_stats in a similar fashin with scx_layered. Undo log_recorder conversion in preparation.	2024-08-20 21:59:20 -10:00
Tejun Heo	1da249f063	scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE Always use the LazyLock versions and drop the counterparts from Topology.	2024-08-20 21:57:56 -10:00
Tejun Heo	092f5422d6	Merge pull request #518 from sched-ext/htejun/misc scx_layered: Add `--run-example` and enable CI testing	2024-08-20 21:42:45 -10:00
Tejun Heo	f7c193e528	scx_utils, scx_rusty: Minor updates to version handling - Update scx_utils/build.rs so that 12 char SHA1 is generated instead of full one. - Add --version to scx_rusty. Use custom one as we don't want to use the default cargo version one.	2024-08-20 21:03:05 -10:00
Tejun Heo	8f786be08f	scx_rusty: cargo fmt	2024-08-20 21:03:05 -10:00
Tejun Heo	4440567949	scx_rusty: Update Cargo.lock	2024-08-20 21:03:05 -10:00
Andrea Righi	c85315d527	scx_bpfland: allow to completely disable interactive classification Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as interactive. However, if interactive tasks classification is disabled (via `-c 0`), we should avoid promoting them as interactive. This is particularly important because, with the nvcsw logic disabled, tasks can remain classified as interactive indefinitely and they will never be demoted to regular tasks. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 08:45:13 +02:00
Andrea Righi	a9f5aaa536	scx_bpfland: replace custom CpuMask with scx_utils::Cpumask Rely on scx_utils::Cpumask instead of re-implementing a custom struct to parse and manage CPU masks. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 07:21:52 +02:00
Daniel Hodges	4d1c932619	scx_layered: Fix core selection Fix a bug introduced in #510 where it assumed core ids are incremental. This refactors the core ordering for layers to be far more simple and provide some space for layer core isolation in low utilization. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-20 19:26:53 -07:00
Andrea Righi	33b6ada98e	Merge pull request #509 from sched-ext/bpfland-topology scx_bpfland: topology awareness	2024-08-20 14:37:23 +02:00
Andrea Righi	467d4b5ea4	scx_bpfland: get topology information from scx_utils::Topology Rely on scx_utils::Topology to get CPU and cache information, instead of re-implementing custom methods. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-20 10:16:02 +02:00
Tejun Heo	c0418250f4	scx_layered: Add --run-example option So that scx_layered can be run in CI environment in a single command.	2024-08-19 20:50:10 -10:00
Changwoo Min	41bc6f0967	Merge pull request #511 from multics69/lavd-perf-profile scx_lavd: add power profile options: --performance, --balanced, --powersave	2024-08-20 09:02:37 +09:00
Changwoo Min	1d61dd4c1d	Merge pull request #508 from multics69/lavd-numa-fix scx_lavd: fix a potential watchdog timeout error at multi-NUMA/CCX platforms	2024-08-20 09:02:23 +09:00
Changwoo Min	2c4c2a0ccf	Merge pull request #507 from multics69/lavd-pretty-rust scx_lavd: revise FlatTopology prettier	2024-08-20 09:01:26 +09:00
Daniel Hodges	05a2721f8e	Merge pull request #510 from hodgesds/layered-core-topo-selection scx_layered: Use topology for core selection	2024-08-19 20:01:16 -04:00
Tejun Heo	d01b49bd0e	scx_layered: Fix verification failure `4fccc06905` ("scx_layered: Fix uninitialized variable") causes the following verification failure. Fix it by moving assignments below range checking. Validating match_layer() func#1... 283: R1=scalar() R2=scalar() R3=mem_or_null(id=49,sz=1) R10=fp0 ; int match_layer(u32 layer_id, pid_t pid, const char cgrp_path) @ main.bpf.c:1029 283: (7b) (u64 )(r10 -24) = r3 ; R3=mem_or_null(id=49,sz=1) R10=fp0 fp-24_w=mem_or_null(id=49,sz=1) 284: (bc) w7 = w1 ; R1=scalar() R7_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) ; struct layer layer = &layers[layer_id]; @ main.bpf.c:1033 285: (bc) w1 = w7 ; R1_w=scalar(id=50,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R7_w=scalar(id=50,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 286: (27) r1 = 1061192 ; R1_w=scalar(smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) 287: (18) r8 = 0xffffc90002a26000 ; R8_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080) 289: (0f) r8 += r1 ; R1_w=scalar(smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) R8_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) ; u32 nr_match_ors = layer->nr_match_ors; @ main.bpf.c:1034 290: (bf) r1 = r8 ; R1_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) R8_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) 291: (07) r1 += 1060992 ; R1_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,off=0x103080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) 292: (61) r1 = (u32 *)(r1 +0) R1 unbounded memory access, make sure to bounds check any such access processed 1099 insns (limit 1000000) max_states_per_insn 2 total_states 72 peak_states 72 mark_read 9 -- END PROG LOAD LOG --	2024-08-19 13:18:20 -10:00
Daniel Hodges	b3793e0069	scx_layered: Use topology for core selection Currently the core selection logic in scx_layered uses the first available core in the bitmask. This is suboptimal when the scheduler is configured with specific NUMA/LLC restrictions. The ideal core selection logic should try to find the least used cores within the preferred scheduling domain and allocate new cpus from shared cores within that domain. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-19 15:51:35 -07:00
Tejun Heo	3498a2b899	Merge pull request #514 from sched-ext/htejun/scx_stats scx_stats, scx_layered: Implement independent stats client sessions	2024-08-19 11:24:53 -10:00
Tejun Heo	f6bc52d31e	scx_layered: Make --monitor behavior more useful - If --monitor is specified with layer specs, the scheduler also starts stats monitoring on a thread. - Standalone monitoring mode no longer exits when the scheduler isn't there.	2024-08-19 10:55:02 -10:00
Tejun Heo	d03e48eb75	scx_layered: Implement per-stats-client nr_layer_cpus_ranges tracking With this, every client sees the correct nr_layer_cpus_ranges without interfering with each other.	2024-08-19 09:12:51 -10:00
Tejun Heo	448aacfd60	scx_layered: Initialize Stats.prev_layer_cycles properly on new() So that new stats session doesn't start with an inflated utilization number.	2024-08-19 08:40:40 -10:00
Tejun Heo	25d7e6f787	scx_layered: Implement on-demand statistics generation Instead of keeping one copy of sched_stats, each stats server session carries their own so that stats can be generated independently by each client at any interval. CPU allocation min/max tracking is broken for now.	2024-08-19 08:27:36 -10:00
Andrea Righi	f8a2445869	scx_bpfland: introduce performance/powersave primary domain The primary scheduling domain represents a group of CPUs in the system where the scheduler will initially attempt to assign tasks. Tasks will only be dispatched to CPUs within this primary domain until they are fully utilized, after which tasks may overflow to other available CPUs. The primary scheduling domain can defined using the option `--primary-domain CPUMASK` (by default all the CPUs in the system are used as primary domain). This change introduces two new special values for the CPUMASK argument: - `performance`: automatically detect the fastest CPUs in the system and use them as primary scheduling domain, - `powersave`: automatically detect the slowest CPUs in the system and use them as primary scheduling domain. The current logic only supports creating two groups: fast and slow CPUs. The fast CPU group is created by excluding CPUs with the lowest frequency from the overall set, which means that within the fast CPU group, CPUs may have different maximum frequencies. When using the `performance` mode the fast CPUs will be used as primary domain, whereas in `powersave` mode, the slow CPUs will be used instead. This option is particularly useful in hybrid architectures (with P-cores and E-cores), as it allows the use of bpfland to prioritize task scheduling on either P-cores or E-cores, depending on the desired performance profile. Example: - Dell Precision 5480 - CPU: 13th Gen Intel(R) Core(TM) i7-13800H - P-cores: 0-11 / max freq: 5.2GHz - E-cores: 12-19 / max freq: 4.0GHz $ scx_bpfland --primary-domain performance 0[\|\|\|\|\|\|\|\|\| 24.5%] 10[\|\|\|\|\|\|\|\| 22.8%] 1[\|\|\|\|\|\| 14.9%] 11[\|\|\|\|\|\|\|\|\|\|\|\|\| 36.9%] 2[\|\|\|\|\|\| 16.2%] 12[ 0.0%] 3[\|\|\|\|\|\|\|\|\| 25.3%] 13[ 0.0%] 4[\|\|\|\|\|\|\|\|\|\|\| 33.3%] 14[ 0.0%] 5[\|\|\|\| 9.9%] 15[ 0.0%] 6[\|\|\|\|\|\|\|\|\|\|\| 31.5%] 16[ 0.0%] 7[\|\|\|\|\|\|\| 17.4%] 17[ 0.0%] 8[\|\|\|\|\|\|\|\| 23.4%] 18[ 0.0%] 9[\|\|\|\|\|\|\|\|\| 26.1%] 19[ 0.0%] Avg power consumption: 3.29W $ scx_bpfland --primary-domain powersave 0[\| 2.5%] 10[ 0.0%] 1[ 0.0%] 11[ 0.0%] 2[ 0.0%] 12[\|\|\|\| 8.0%] 3[ 0.0%] 13[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 64.2%] 4[ 0.0%] 14[\|\|\|\|\|\|\|\|\|\| 29.6%] 5[ 0.0%] 15[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 52.5%] 6[ 0.0%] 16[\|\|\|\|\|\|\|\|\| 24.7%] 7[ 0.0%] 17[\|\|\|\|\|\|\|\|\|\| 30.4%] 8[ 0.0%] 18[\|\|\|\|\|\|\| 22.4%] 9[ 0.0%] 19[\|\|\|\|\| 12.4%] Avg power consumption: 2.17W (Info collected from htop and turbostat) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-19 20:19:21 +02:00
Andrea Righi	174993f9d2	scx_bpfland: introduce cache awareness While the system is not saturated the scheduler will use the following strategy to select the next CPU for a task: - pick the same CPU if it's a full-idle SMT core - pick any full-idle SMT core in the primary scheduling group that shares the same L2 cache - pick any full-idle SMT core in the primary scheduling grouop that shares the same L3 cache - pick the same CPU (ignoring SMT) - pick any idle CPU in the primary scheduling group that shares the same L2 cache - pick any idle CPU in the primary scheduling group that shares the same L3 cache - pick any idle CPU in the system While the system is completely saturated (no idle CPUs available), tasks will be dispatched on the first CPU that becomes available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-19 20:19:21 +02:00
Tejun Heo	27c530e17e	scx_stats: Add missing trait exports	2024-08-19 07:16:43 -10:00
Tejun Heo	0cf5ca605d	scx_layered: Move processing_dur accounting into Stats and protect it with Arc<Mutex<>>	2024-08-19 06:25:23 -10:00
Tejun Heo	a77fe372d6	scx_stats: Make server shutdown when connection is dropped and add communication channel This will make implementing connection sessions easier where each stats client connection maintains a set of states.	2024-08-19 06:23:16 -10:00
Changwoo Min	832f194845	scx_lavd: add power profile options: --performance, --powersave, --balanced Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-19 19:03:51 +09:00
Changwoo Min	c4c157f91c	scx_lavd: add "--prefer-little-core" option This option chooses little (effiency) cores over big (performance) cores to save power consumption for core compaction. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-19 18:23:35 +09:00
Changwoo Min	73b873827d	scx_lavd: merge put_cpdom_rq() to ops.enqueue() Clean and reorganized the code around ops.enqueue() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-19 14:22:03 +09:00
Changwoo Min	9475ace336	scx_lavd: always enqueue to a DSQ in task's compute domain Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-19 14:07:56 +09:00
Changwoo Min	0656c3232e	scx_lavd: revise FlatTopology prettier The changes include 1) chopping down a big function into smaller ones for readability and maintainability and 2) using the interior mutability pattern (Cell and RefCell) to avoid unnecessary clone() calls. There are no functional changes. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-19 11:03:52 +09:00
I Hsin Cheng	4fccc06905	scx_layered: Fix uninitialized variable Fix the uninitialized variable "layer" in the function match_layer which caused the compiling process to fail. "layer" is supposed to be the same as "&layers[layer_id]". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-08-17 23:32:53 +08:00
Tejun Heo	3a688cfde7	scx_stats: Add support for no-value user attributes and a bunch of other changes - Allow no-value user attributes which are automatically assigned "true" when specified. - Make "top" attribute string "true" instead of bool true for consistency. Testing for existence is always enough for value-less attributes. - Don't drop leading "_" from user attribute names when storing in dicts. Dropping makes things more confusing. - Add "_om_skip" to scx_layered fields which don't jive well with OM. scxstats_to_openmetrics.py is updated accordignly and no longer generates warnings on those fields. - Examples and README updated accordingly.	2024-08-16 07:52:02 -10:00
I Hsin Cheng	5d85937842	scx_rusty: Fix typo Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-08-16 22:03:59 +08:00
Tejun Heo	c16b48d7b2	scheds/rust: Include Cargo.lock in the repo Binary packages are expected to include Cargo.lock in the repo so that the produced binaries match across different builds.	2024-08-15 23:08:35 -10:00
Tejun Heo	22167aeb14	Merge pull request #502 from sched-ext/htejun/scx_stats scx_stats: Refine scx_stats and implement scxstats_to_openmetrics.py	2024-08-15 22:55:11 -10:00
Tejun Heo	570ca56c57	scx_layered: s/_om_field_prefix/_om_prefix/	2024-08-15 21:29:58 -10:00
Tejun Heo	af01dd19ec	Merge pull request #500 from sched-ext/htejun/scx_stats scx_stats, scx_layered: Add `om_prefix` attribute and fix s/stat/stats/ stragglers	2024-08-15 21:27:38 -10:00
Tejun Heo	ea453e51d3	scx_stats: Rename "all" attribute to "top" and clean up examples a bit	2024-08-15 21:24:55 -10:00
Tejun Heo	a910fa451a	scx_layered: Add _om attributes to LayerStats for OpenMetrics piping	2024-08-15 19:11:49 -10:00
Tejun Heo	6a5d6f7c27	scx_stats: Replace field_prefix attribute with '_' prefixed user attributes	2024-08-15 19:09:59 -10:00
Tejun Heo	a9922deaa2	scx_stats: Add "all" attribute and rename metadata type strings	2024-08-15 14:50:00 -10:00
Tejun Heo	ebc1a89c34	scx_stats: s/stat/stats/ stragglers	2024-08-15 14:00:00 -10:00
Tejun Heo	bafd67b568	scx_stats: Fix parsing for multiple stat attributes The code was assuming single attribute per #[stat()] block. Update it so that there can be multiple comma separated attributes in a single block.	2024-08-15 13:46:20 -10:00
Tejun Heo	8f361af077	scx_layered: Shorten stat field descriptions	2024-08-15 13:25:48 -10:00
Tejun Heo	1912e05f0b	Merge pull request #499 from sched-ext/htejun/scx_stats scx_stats: Misc changes to sync dep versions and publish on crates.io	2024-08-15 12:32:44 -10:00
Tejun Heo	0b9c8b5cbd	scx_stats: Update versions to 0.2.0 to republish	2024-08-15 12:29:27 -10:00
Daniel Hodges	0319afc88e	scx_layered: Update nr_cpus when resizing layers After updating scx_layered to be topology aware the nr_cpus field on the layer was not being updated properly. Update layer growing/shrinking logic to correctly update the nr_cpus count. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-15 13:22:26 -07:00
Tejun Heo	cc73b6a826	Merge pull request #496 from sched-ext/htejun/scx_stat scx_stat: Initial commit	2024-08-15 09:24:55 -10:00
Tejun Heo	b614cf848f	scx_layered: Make monitor time based iterations dumber This makes ctrl-c a bit more responsive without complicating code.	2024-08-15 09:23:29 -10:00
Tejun Heo	45fb724ee2	scx_layered: Restore cpumask reporting	2024-08-15 09:12:29 -10:00
Tejun Heo	751a38e34e	scx_layered: Refactor stats printing code	2024-08-15 08:53:19 -10:00
Tejun Heo	a4f424056e	scx_layered: Move stats server launching to stats.rs	2024-08-15 06:30:42 -10:00
Tejun Heo	17afc72479	scx_stats: Rename cleanups - s/stat/stats/ on several stragglers. - Rename traits so that they are more distinctive from struct and other names and follow the convention.	2024-08-15 06:24:56 -10:00
Tejun Heo	a091d5ea7d	scx_layered: s/monitor.rs/stats.rs/ and make stats refresh code struct ops	2024-08-15 06:13:05 -10:00
Tejun Heo	8aae9a5de2	scx_stats: s/scx_stat/scx_stats/ Use plural form which is more widespread and also used in scheduler implementations. No functional changes.	2024-08-15 05:31:34 -10:00
Tejun Heo	6e466d18df	scx_layered: Initial switch to scx_stat - This makes the scheduler side simpler and allows on-demand monitoring. - OpenMetrics support is dropped for now. Will add a generic tool for it. - This is a naive conversion. Will be further refined. scx_layered no longer prints statistics by default. To watch statistics, run `scx_layered --monitor` while the scheduler is running.	2024-08-14 13:48:41 -10:00
Tejun Heo	7820ec9b46	scx_stat, scx_layered: cargo fmt	2024-08-14 11:47:37 -10:00
Tejun Heo	099b6c266a	scx_lavd: Build fix Add "signal" feature to nix dependency; otherwise, build fails.	2024-08-14 07:55:04 -10:00
Andrea Righi	0f018c5fff	Merge pull request #484 from vax-r/rustland_unused scx: Remove unused variables, imports and functions	2024-08-14 19:03:26 +02:00
Andrea Righi	f9a994412d	scx_bpfland: introduce primary scheduling domain Allow to specify a primary scheduling domain via the new command line option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of arbitrary length, representing the CPUs assigned to the domain. If this option is not specified the scheduler will use all the available CPUs in the system as primary domain (no behavior change). Otherwise, if a primary scheduling domain is defined, the scheduler will try to dispatch tasks only to the CPUs assigned to the primary domain, until these CPUs are saturated, at which point tasks may overflow to other available CPUs. This feature can be used to prioritize certain cores over others and it can be really effective in systems with heterogeneous cores (e.g., hybrid systems with P-cores and E-cores). == Example (hybrid architecture) == Hardware: - Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H - 6 P-cores 0..5 with 2 CPUs each (CPU from 0..11) - 8 E-cores 6..13 with 1 CPU each (CPU from 12..19) == Test == WebGL application (https://webglsamples.org/aquarium/aquarium.html): this allows to generate a steady workload in the system without over-saturating the CPUs. Use different scheduler configurations: - EEVDF (default) - scx_bpfland using P-cores only (--primary-domain 0x00fff) - scx_bpfland using E-cores only (--primary-domain 0xff000) Measure performance (fps) and power consumption (W). == Result == +-----+-----+------+-----+----------+ \| min \| max \| avg \| \| \| \| fps \| fps \| fps \| stdev \| power \| +-----------------+-----+-----+------+-------+--------+ \| EEVDF \| 28 \| 34 \| 31.0 \| 1.73 \| 3.5W \| \| bpfland-p-cores \| 33 \| 34 \| 33.5 \| 0.29 \| 3.5W \| \| bpfland-e-cores \| 25 \| 26 \| 25.5 \| 0.29 \| 2.2W \| +-----------------+-----+-----+------+-------+--------+ Using a primary scheduling domain of only P-cores with scx_bpfland allows to achieve a more stable and predictable level of performance, with an average of 33.5 fps and an error of ±0.5 fps. In contrast, using EEVDF results in an average frame rate of 31.0 fps with an error of ±3.0 fps, indicating slightly less consistency, due to the fact that tasks are evenly distributed across all the cores in the system (both slow and fast cores). On the other hand, using a scheduling domain solely of E-cores with scx_bpfland results in a lower average frame rate (25.5 fps), though it maintains a stable performance (error of ±0.5 fps), but the power consumption is also reduced, averaging 2.2W, compared to 3.5W with either of the other configurations. == Conclusion == In summary, with this change users have the flexibility to prioritize scheduling on performance cores for better performance and consistency, or prioritize energy efficient cores for reduced power consumption, on hybrid architectures. Moreover, this feature can also be used to minimize the number of cores used by the scheduler, until they reach full capacity. This capability can be useful for reducing power consumption even in homogeneous systems or for conducting scheduling experiments with smaller sets of cores, provided the system is not overcommitted. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Andrea Righi	a6e977c70b	scx_bpfland: make output more compact Abbreviate the statistics reported to stdout and remove the slice_ms metric: this metric can be easily derived from slice_ns, slice_ns_min and nr_wait, which is already reported to stdout. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Andrea Righi	8656effa50	scx_bpfland: update copyright info Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Changwoo Min	3c6d86b342	scx_lavd: upgrade nix package from 0.28.0 to 0.29.0 Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-14 22:31:05 +09:00
Changwoo Min	444f0b86a5	Merge pull request #489 from multics69/lavd-amp-v4 lavd: make LAVD core-type (AMP) aware	2024-08-14 14:24:09 +09:00
Tejun Heo	4612764b82	Merge pull request #486 from vax-r/Fix_rusty_logic scx_rusty: Fix logical error when filtering tasks	2024-08-13 09:39:12 -10:00
Daniel Hodges	646cefd46d	Merge pull request #477 from hodgesds/layered-global-match scx_rusty: Make layer matching a global function	2024-08-12 09:14:58 -04:00
Daniel Hodges	be5213e129	scx_rusty: Make layer matching a global function Layer matching currently takes a large number of bpf instructions. Moving layer matching to a global function will reduce the overall instruction count and allow for other layer matching methods such as glob. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-12 05:44:34 -07:00
Changwoo Min	b7b8c8de90	scx_lavd: fix build errors Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 14:10:40 +09:00
Changwoo Min	182b0bd249	scx_lavd: make the verifier in 6.8 kernel happy Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:04:04 +09:00
Changwoo Min	4ecf3fc94e	scx_lavd: build cpdom map from rust Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:03:18 +09:00
Changwoo Min	1f1a3dc4f1	scx_lavd: sort cores in descending order of max freq Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	c213a3e44f	scx_lavd: make core compaction core type aware Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	c35b6b27ff	scx_lavd: consider task pinning for core-type-aware ops.enqueue() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	25bf98d2a0	scx_lavd: make ops.select_cpu() core type aware Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	fa87e1c593	scx_lavd: make ops.dispatch() core type aware Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	c1cf11f7b1	scx_lavd: make ops.enqueue() core type aware Put a performance-critical task to a performance critical queue and a regular task to a regular queue. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	03a8c10ece	scx_lavd: add cpdom_ctx to abstract compute domain and its DSQ Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	623b05a282	scx_lavd: revise perf_cri factor to reflect wakeup, runtime, and run_freq Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	15871fd032	scx_lavd: turn off pinned core less aggressively Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	9dc7f94cb6	scx_lavd: unifiy the deadline calculation and ineligibility calculation The unified version is not only simpler but also works better. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:40 +09:00
Changwoo Min	4705520d40	scx_lavd: remove unnecessary options which has never been used Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-12 13:01:34 +09:00
I Hsin Cheng	15b40de408	scx_rusty: Fix logical error when filtering tasks The logic of tasks filtering were moved from find_first_candidate() into a vector filter operation in commit `1c3b563`. However, it was forgotten to transfer the logic with "NOT" since now .filter() will populate the tasks we want, rather than .skip_while() which was throwing unwanted tasks out. That's why the logic here should be reverse so we won't take kworker or migrated tasks into considerations. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-08-10 22:56:20 +08:00
I Hsin Cheng	4e40ba3b11	scx_rustland: Removed unused imports and variables The member "topo_map" in Scheduler is never used and thus should be removed, the related imports are removed as well. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-08-09 20:35:12 +08:00
I Hsin Cheng	b7e03b7a76	scx_bpfland: Remove unused variable Remove unused variable "vtime" in task_vtime(). Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-08-09 20:28:42 +08:00
Tejun Heo	45f7fd13b7	versions: Synchronize crate dependency versions	2024-08-08 14:45:46 -10:00
Tejun Heo	63c4a0191f	Merge branch 'main' into topic/inlined-skeleton-members	2024-08-08 14:23:37 -10:00
Tejun Heo	cd6a4d72c7	Bump versions for 1.0.2 release	2024-08-08 14:10:16 -10:00
Tejun Heo	7c3ffe96e1	Unify crate dependency versions Different sub-projects are using different versions for the same crates. Synchronize them to the latest.	2024-08-08 13:26:47 -10:00
Andrea Righi	9d808ae206	Merge pull request #468 from sched-ext/rustland-refactoring scx_rustland refactoring	2024-08-07 11:38:21 +02:00
Andrea Righi	51cfb69199	scx_rustland_core: re-introduce partial mode Re-add the partial mode option that was dropped during the refactoring. The partial option allows to apply the scheduler only to the tasks which have their scheduling policy set to SCHED_EXT via sched_setscheduler(). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:41:06 +02:00
Andrea Righi	e1f2b3822e	scx_rustland_core: drop CPU ownership API The API for determining which PID is running on a specific CPU is racy and is unnecessary since this information can be obtained from user space. Additionally, it's not reliable for identifying idle CPUs. Therefore, it's better to remove this API and, in the future, provide a cpumask alternative that can export the idle state of the CPUs to user space. As a consequence also change scx_rustland to dispatch one task a time, instead of dispatching tasks in batches of idle cores (that are usually not accurate due to the racy nature of the CPU ownership interaface). Dispatching one task at a time even makes the scheduler more performant, due to the vruntime scheduling being applied to more tasks sitting in the scheduler's queue. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:41:06 +02:00
Andrea Righi	9a0e7755df	scx_rustland_core: export counter of online CPUs Introduce a helper to get the amount of online CPUs tracked by the BPF part. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	d9c9f78e3e	scx_rustland: re-align vruntime and time slice evaluation to scx_bpfland Drop the slice boost logic and apply a vruntime and task time slice evaluation approach similar to scx_bpfland (but implement this in the user-space component instead of the BPF part). Additionally, introduce a slice_us_min parameter to define the minimum time slice that can be assigned to a task, also similar to scx_bpfland. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	38a725ea34	scx_rlfifo: update copyright info Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	c963d5eb05	scx_rustland: update copyright info Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	b87541a26e	scx_rustland_core: refactor idle CPU selection logic Use the same idle selection logic used in scx_bpfland also in scx_rustland_core. Also drop fifo_mode and always use the BPF idle selection logic by default as long as the system is not saturated, unless full_user is specified. This approach allows user-space schedulers aiming for maximum performance to leverage the BPF idle selection logic (bypassing user-space), while those seeking full control can enable full_user to bypass the BPF CPU idle selection logic and choose the target CPU for each task from user-space. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	d8985306f4	scx_rustland: user-space interactive task classifier We don't need to send the number of voluntary context switches (nvcsw) from BPF to user-space, as this information is already accessible in user-space via procfs. Sending this data would only create unnecessary overhead for schedulers that don't require it, and those that do can easily retrieve it through procfs. Therefore, drop this metric from scx_rustland_core and change scx_rustland implementing an interactive task classifier fully in the user-space part of the scheduler. Also drop some options that are not provide any significant benefit (also in preparation of a bigger refactoring to define a better API for the user-space framework). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-06 17:56:58 +02:00
Daniel Hodges	d5efcd3245	scx_layered: Fix cred declaration The use of the cred struct should be const. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-06 05:22:12 -07:00
Tejun Heo	b226865b96	scx_lavd: Make FlatTopology::new() a bit prettier - Use .enumerate() consistently while building the cpu_fids vector. - Use .then_with() to chain .cmp() when sorting cpu_fids. Both reduce visual clutter.	2024-08-04 11:16:19 -10:00
Changwoo Min	130ea97fbf	Merge pull request #464 from multics69/lavd-amp-v3 scx_lavd: improve the calculation of ineligibility duration	2024-08-03 09:57:41 +09:00
Andrea Righi	3ad2875240	Merge pull request #463 from sched-ext/bpfland-update-dsq-vtime scx_bpfland: always re-align task's vruntime to the global vruntime	2024-08-02 22:13:12 +02:00
Daniel Hodges	1f922b9a73	scx_layered: Add support for disabling topology awareness Add a parameter to disable topology awareness. This is useful when trying to compare the scheduling performance of topology aware scheduling compared to the previous scheduling strategy. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-08-02 08:07:19 -07:00
Changwoo Min	f3fd6e9cb3	scx_lavd: drop 2-level-scheduling With optimizations of calculatring ineligibility duration, now the scheduler works well under heavy load without 2-level scheduling, so we drop it for simplicitiy. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-02 21:46:07 +09:00
Changwoo Min	c38e749c36	scx_lavd: improve the equation for calculating ineligibility duration This commit include a few changes: - treat a new forked task more conservatively - defer the execution of more tasks for longer time using ineligibility duration - consider if a task is waken up in calculating ineligibility duration	2024-08-02 21:08:29 +09:00
Andrea Righi	bee0d699ef	scx_bpfland: always re-align task's vruntime to the global vruntime Immediately re-align p->scx.dsq_vtime to the global vruntime (+/- slice lag) as soon as we are evaluating the task's vruntime. This allows rapidly chase the minimum global vruntime, ensuring to not over prioritize tasks tasks with a predominantly sleeping behavior pattern. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-08-02 13:11:25 +02:00
Changwoo Min	5e194330f0	scx_lavd: consider task's wakeup and vruntime (starvation) more aggressively Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-08-02 12:25:29 +09:00
Daniel Hodges	de7b5fe190	scx_layered: Fix dispatch fallback CPU selection When the previous CPU for a task is not known do not fall back to dispatching to CPU 0, use the current CPU. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-31 12:35:22 -07:00
Changwoo Min	fc0ffeb45b	scx_lavd: print the overall status of a scheduled task L or R: Latency-critical, Regular H or I: performance-Hungry, performance-Insensitive B or T: Big, liTtle E or G: Eligible, Greedy P or N: Preemption, Not Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 19:00:35 +09:00
Changwoo Min	22d4b13e8e	scx_lavd: classify CPUs into BIG and little ones based on their average capacity Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 19:00:35 +09:00
Changwoo Min	0ad2f30fa8	Merge pull request #460 from multics69/lavd-misc scx_lavd: misc updates	2024-07-31 08:55:04 +09:00
Daniel Hodges	c224154866	Merge pull request #459 from hodgesds/layer-cpu-counter scx_layered: Add per cpu layer iterator offset	2024-07-30 16:00:37 -04:00
Daniel Hodges	4f12bebaa5	scx_layered: Add per cpu layer iterator offset Add a per cpu counter offset to round robin when iterating on layers. This is to make selection from different layers more fair. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-30 10:44:41 -07:00
Changwoo Min	9b455cf010	Merge pull request #458 from sched-ext/lavd-fix-cpu-ctx-size scx_lavd: set correct size for cpu_ctx_stor	2024-07-31 00:39:13 +09:00
Changwoo Min	6136cbee65	scx_lavd: tuning the time slice and preemption margins Tuning the time slice under high load and change the kick/tick margins for preemption more conservative. Especially, aggressive IPI-based preemption (kick) causes performance unstability. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:30:59 +09:00
Changwoo Min	35b0d9f3c2	scx_lavd: improve starvation factor equation Instead of using coarse-grained log(), let's directly use the ratio of task's service time. Also, the virtual dealine equation is also updated to reflect this change. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:27:17 +09:00
Changwoo Min	f9657a549f	scx_lavd: fix bpf verification error in old kernel versions Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:22:43 +09:00
Changwoo Min	d2615b4975	scx_lavd: fix warnings from the rust code Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:21:32 +09:00
Andrea Righi	2015faa745	scx_lavd: set correct size for cpu_ctx_stor The max_entries parameter in BPF_MAP_TYPE_PERCPU_ARRAY defines the number of values per CPU and for cpu_ctx_stor we only need one item: the CPU context. Set max_entries to 1 to avoid allocating unnecessary memory and slightly reduce the memory footprint. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-30 09:32:55 +02:00
Changwoo Min	643edb5431	Merge pull request #457 from multics69/lavd-amp-v2 scx_lavd: support two-level scheduling for heavy-loaded cases (like bpfland)	2024-07-30 10:39:06 +09:00
Changwoo Min	b91c1e4759	scx_lavd: add more comments on no_2_level_scheduling implementation Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-29 12:22:28 +09:00
Changwoo Min	f71fff9bbe	scx_lavd: print a warning message when system does not provide a proper freq info Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:53:02 +09:00
Changwoo Min	4449d8e31c	scx_lavd: incorporate a task's static priority in calculating its latency criticality That's because static (nice) priority is a strong hint to distinguish latency-critical tasks. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:41:43 +09:00
Changwoo Min	221f1fe12a	scx_lavd: further prioritize producers over consumers That is because many latency-critical tasks are producers. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:38:54 +09:00
Changwoo Min	7106e8cdca	scx_lavd: support two-level scheduling for heavy-loaded cases We introduce two-level scheduling similar to scx_bpfland. The two-level scheduling consists of two DSQs: 1) latency-critical run queue and 2) regular run queue. The scheduler prioritizes scheduling tasks on the latency-critical queue but makes its best effort to schedule tasks on the regular queue. The scheduler could be more resilient under heavy load by segregating regular, non-latency-critical tasks from latency-critical tasks. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:33:17 +09:00
Changwoo Min	9236c3e57c	scx_lavd: increase the targeted latency for heavy loaded cases Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:30:01 +09:00
Changwoo Min	230512208d	scx_lavd: fix div by zero error in some installations The max frequency information from topology (from sysfs) seems not always true. In some installations, it returns zero for all CPUs. In this case, let's just consider all CPUs have the same capacity (1024), hoping the kernel can give more preceise information. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 12:47:00 +09:00
Changwoo Min	59e54f4972	scx_lavd: print how to disable logging Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 12:31:51 +09:00
Changwoo Min	df1108ec6c	scx_lavd: segregate starvation factor from the latency criticality (refactoring) Latency criticality is a task's inherent property, but the starvation factor is its dynamic status for the urgency of scheduling. Hence, we segregate the starvation factor out. Also, cleaned up unnecessary arguments and struct fields related. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-27 17:25:39 +09:00
Changwoo Min	d4a5a629ff	Merge pull request #452 from multics69/lavd-core-compaction-v2 lavd_lavd: initial support for AMP (asynmmetric multi-processor) architecture	2024-07-27 16:22:27 +09:00
Changwoo Min	eeea847697	scx_lavd: adjust time slice based on CPU's capacity When a task is running on more performant core, the scheduler will give a longer time slice. On the other hand, on a less performant core, a shorter time slice will be assigned. The longer time slice helps boosting clock frequency on a performant core. Also, the shorter time slice gives more chance the performant core being utilized. Regarding the CPU capacity, we first check if kernel-provided capacitiy values are trustworthy or not. If not (i.e., all the same values), we rely on the user-provided value, based on each CPU's maximum clock frequency. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	e7b6ed1838	scx_lavd: add --prefer-smt-core option With the --prefer-smt-core option is on, the core compaction prefers to utilizae hyper-twin first before utilizing the other physical CPUs. By default, the option is off. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	19e337cd9b	scx_lavd: make the core compaction AMP-aware Previously, the core compaction assumed that each core's capacity was the same. Now, we additionally consider each core's max clock frequency. So, it always tries to use the higher-frequency cores first. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	dbb3957eb1	scx_lavd: add a missing no_freq_scaling option check Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	90b57a3fd7	scx_lavd: put a pinned kernel task to an overflow set Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	e76bf999df	scx_lavd: clean up constants (no functional changes) Remove unused constants and rename outdated constants to proper names (LAVD_TC_* to LAVC_CC_* and LAVD_ELIGIBLE_DSQ to LAVD_GLOBAL_DSQ). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Andrea Righi	19854f1535	scx_bpfland: allow to specify negative values with --slice-us-lag Using negative values with --slice-us-lag can be useful to make performance more consistent and prioritize newly created tasks over the running tasks. Therefore, allow to specify negative values from the command line and also update the documentation of this option. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-26 09:10:18 +02:00
David Vernet	5401876430	Revert "rusty: Rework deadline as a signed sum"	2024-07-25 14:50:45 -05:00
David Vernet	09536aa15d	Merge pull request #309 from sched-ext/rusty_improved_dl rusty: Rework deadline as a signed sum	2024-07-25 13:44:54 -05:00
David Vernet	c1ad602ce5	rusty: Transfer latency priority between CPU-intensive and interactive tasks In some scenarios, a CPU-intensive task may be on the critical path for interactive workloads. For example, you may have a game with CPU-intensive tasks that are crunching the logic for the game, and that's required for the game to proceed without being choppy. To support such workflows, this change adds logic to allow a non-interactive task to inherit the lower (i.e. stronger) latency priority of another task if it wakes or is woken by that task. Signed-off-by: David Vernet <void@manifault.com>	2024-07-25 11:55:40 -05:00
David Vernet	933ea9baa1	rusty: Rework deadline as a signed sum Currently, a task's deadline is computed as its vtime + a scaled function of its average runtime (with its deadline being scaled down if it's more interactive). This makes sense intuitively, as we do want an interactive task to have an earlier deadline, but it also has some flaws. For one thing, we're currently ignoring duty cycle when determining a task's deadline. This has a few implications. Firstly, because we reward tasks with higher waker and blocked frequencies due to considering them to be part of a work chain, we implicitly penalize tasks that rarely ever use the CPU because those frequencies are low. While those tasks are likely not part of a work chain, they also should get an interactivity boost just by pure virtue of not using the CPU very often. This should in theory be addressed by vruntime, but because we cap the amount of vtime that a task can accumulate to one slice, it may not be adequately reflected after a task runs for the first time. Another problem is that we're minimizing a task's deadline if it's interactive, but we're also not really penalizing a task that's a super CPU hog by increasing its deadline. We sort of do a bit by applying a higher niceness which gives it a higher deadline for a lower weight, but its somewhat minimal considering that we're using niceness, and that the best an interactive task can do is minimize its deadline to near zero relative to its vtime. What we really want to do is "negatively" scale an interactive task's deadline with the same magnitude as we "positively" scale a CPU-hogging task's deadline. To do this, we make two major changes to how we compute deadline: 1. Instead of using niceness, we now instead use our own straightforward scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we can and should improve this in the future. 2. We now create a _signed_ linear latency priority factor as a sum of the three following inputs: - Work-chain factor (log_2 of product of blocked freq and waker freq) - Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle -- higher duty cycle means lower factor) - Average runtime factor (Higher avg runtime means higher average runtime factor) We then compute the latency priority as: lat_prio := Average runtime factor - (work-chain factor + duty cycle factor) This gives us a signed value that can be negative. With this, we can compute a non-negative weight value by calculating a weight from the absolute value of lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive, it's the task's vtime PLUS scaled slice_ns. This ends up working well because you get a higher weight both for highly interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets you scale a task's deadline "more negatively" for interactive tasks, and "more positively" for the CPU hogs. With this change, we get a significant improvement in FPS. On a 7950X, if I run the following workload: $ stress-ng -c $((8 * $(nproc))) 1. I get 60 FPS when playing Stellaris (while time is progressing at max speed), whereas EEVDF gets 6-7 FPS. 2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The Civ6 benchmark doesn't even start after over 4 minutes in the initial frame with EEVDF, but gets us 13s / turn with rusty. 3. It seems that EEVDF has improved with Terraria in v6.9. It was able to maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past. rusty is still able to maintain a solid 60-62FPS consistently with no problem, however.	2024-07-25 11:55:03 -05:00
Daniel Hodges	4c3fd6cd9b	scx_layered: Rename UserId and GroupId TLDR; rename UserId and GroupId to UIDEquals and GIDEquals. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 15:09:08 -07:00
Daniel Hodges	55f6d68eef	scx_layered: Add user and group layers Add a layer match based on either the effective user id or the effective group id. This allows for creating layers for individual users or groups. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 15:09:08 -07:00
Daniel Hodges	4042fc42d7	Merge pull request #446 from hodgesds/layered-topo scx_layered: Add topology awareness for NUMA nodes and LLCs	2024-07-24 18:06:43 -04:00
Daniel Hodges	2803f9c127	scx_layered: Fix formatting issues Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 14:39:02 -07:00
Daniel Hodges	0814abf0b8	scx_layered: Add node topology awareness Add NUMA node topology awareness for scx_layared. This borrows some of the NUMA handling from scx_rusty and allows layers to set a node mask. Different layer kinds will use the node mask differently. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 09:53:48 -07:00
Daniel Müller	98af514972	scx_rusty: Simplify LoadBalancer::populate_tasks_by_load() Simplify LoadBalancer::populate_tasks_by_load() by cutting out the heap allocation bits, by moving mutable accesses in front of immutable ones. Because multiple immutable accesses (between bss and rodata) do not conflict, we don't need the intermediate PID storage. Signed-off-by: Daniel Müller <deso@posteo.net>	2024-07-23 13:59:26 -07:00
Andrea Righi	46ddca6bd5	scx_bpfland: report task time slice to stdout Periodically report to stdout samples of the effective time slice applied to tasks. While one could determine this metric by examining the max slice_ns and nr_waiting metrics, directly reporting it to stdout allows users to quickly identify what is happening and it provides a clearer overview of the scheduling behavior. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	c1d93d2a00	scx_bpfland: drop kthread dispatches metric Dispatching per-CPU kthreads directly is disabled by default, reporting this metric can generate some confusion (since it is always 0), and even if local kthread dispatches are enabled, they should be still considered as regular direct dispatches (there is no difference in practice). Therefore, merge direct kthread dispatches into direct dispatches and drop the separate nr_kthread_dispatches metric. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	a5f1d6b595	scx_bpfland: show average amount of tasks waiting to be dispatched Periodically report the average amount of tasks sitting in the priority and shared DSQs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:45 +02:00
Andrea Righi	5908a985bc	scx_bpfland: adjust task time slice based on the amount of waiting tasks Scale the task's time slice based on the average amount of tasks that are currently waiting to be dispatched. Use a moving average for the amount of waiting tasks to smooth out potential spikes caused by temporary bursts of tasks piling in the wait queues. This was initially modeled in scx_rustland and it seems to work pretty well also in scx_bpfland now. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 21:53:25 +02:00
Changwoo Min	af75d147c8	Merge pull request #443 from multics69/lavd-vtime scx_lavd: overhaul the virtual deadline algorithm	2024-07-21 18:00:57 +09:00
Changwoo Min	a9aab6b229	scx_lavd: fix typo Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-21 17:58:44 +09:00
Changwoo Min	add96f0e18	scx_lavd: do not maintain ineligible runnable tasks separately With all the other optimizations and tunings, it turns out that maintaining two runqueues has more harm than good. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 17:49:12 +09:00
Changwoo Min	827187d213	scx_lavd: adjust ineligible duration according to task's lat_cri Further depenalize above-average latency-critical tasks and penalize further below-avergage latency-critical tasks in ineligibility duration. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 17:37:27 +09:00
Changwoo Min	c653622ed9	scx_lavd: add LAVD_VDL_LOOSENESS_FT in calculating virtual deadline LAVD_VDL_LOOSENESS_FT represents how loose the deadline is. The smaller value means the deadline is tighter. While it is unlikely to be tuned, let's keep it as a tunable for now. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 12:00:50 +09:00
Changwoo Min	e94070d5ca	scx_lavd: remove LAVD_BOOST_* These are no longer necessary after directly using latency criticality. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 11:53:20 +09:00
Changwoo Min	43f0fcb87c	scx_lavd: removed unused LAVD_LOAD_FACTOR_* These are no longer necessary after remnoving load factor calculation. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 11:51:12 +09:00
David Vernet	4f11e2abe2	layered: Don't dispatch to LO_FALLBACK_DSQ Non-kthreads with custom affinities in non-open layers are dispatched into a LO_FALLBACK_DSQ, with the idea being that they're penalized for their custom affinities. When a host is fully utilized, these tasks can end up being starved due to LO_FALLBACK_DSQ being consumed only when there are no other layers to consume from. In internal workloads at Meta, we've observed that this can happen in practice. Longer term, we can probably address this by implementing layer weights and applying that to fallback DSQs to avoid starvation. For now, let's just dispatch them to HI_FALLBACK_DSQ to avoid this starvation issue. Signed-off-by: David Vernet <void@manifault.com>	2024-07-19 19:14:18 -05:00
Changwoo Min	3924ebaa4d	scx_lavd: properly synchronize taskc->vdeadline_log_clk Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 01:41:29 +09:00
Changwoo Min	02ad43d116	scx_lavd: directly use p->scx.weight instead load_ideal Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 00:25:11 +09:00
Changwoo Min	c955caefd8	scx_lavd: drop sys_load_factor In theory, sys_load_factor should not be necessary since we do not stretch the time space anymore. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 00:10:29 +09:00
Changwoo Min	67a6deb983	scx_lavd: use lat_cri instead of lat_prio universally Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 23:56:51 +09:00
Daniel Hodges	b98a9f56a8	scx_layered: Add separate module for metrics Refactor the main module for scx_layered to move metrics into a separate module. This change does no functional differences, only code structure. This will make it a little easier to navigate the logic in the main scheduler code. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-19 07:40:24 -07:00
Changwoo Min	6f10d6907c	scx_lavd: drop sched_prio_to_slice_weight[] table Use p->scx.weight instead. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 22:39:01 +09:00
Changwoo Min	034303f00f	scx_lavd: consider starvation factor in determining latency criticality Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 22:17:50 +09:00
Daniel Hodges	d974690b5d	Merge pull request #435 from vax-r/remove_skip_while scx_rusty: Remove skip_while in find_first_candidate	2024-07-19 08:38:58 -04:00
Changwoo Min	99e0d21c3c	scx_lavd: drop the runtime factor in calculating latency criticality That is okay since the runtime is considered in calculating a virtual deadline. A shorter runtime will result in a tighter deadline linearly. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 17:28:40 +09:00
Changwoo Min	b90599e967	scx_lavd: do not inherit parent's properties If inheriting the parent's properties, a new fork task tends to be too prioritized. That is, many parent processes, such as `make,` are a bit more latency-critical than average. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 15:29:13 +09:00
Andrea Righi	c4eb3ce7b4	scx_bpfland: introduce dynamic nvcsw threshold Instead of using a static value to classify tasks based on their average amount of voluntary context switches, try to periodically evaluate an optimal threshold, based on a global average of voluntary context switches among of all the running tasks. Tasks with an average amount of voluntary context switches greater than the global average will be classified as interactive. The global average is evaluated as an exponentially weighted moving average (EWMA), as: avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25 This approach is more efficient than iterating through all tasks and it helps to prevent rapid fluctuations that may be caused by bursts of voluntary context switch events. The dynamic nvcsw threshold enables a more precise adjustment of the classification criteria to swiftly respond to global system changes: tasks can be quickly classified as interactive, but if the system experiences too many interactive events, the criteria for maintaining interactive status become stricter. This creates a natural selection process where only the most deserving tasks remain interactive. Additionally, introduce the new option `--nvcsw-max-thresh N`, which allows to extend or restrict the fluctuation range of the global average threshold for voluntary context switches. Tested-by: Piotr Gorski <piotrgorski@cachyos.org> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-18 19:03:25 +02:00
Changwoo Min	78d96a6fb6	scx_lavd: advance clock by reverse proportional to the system load Advancing the clock slower when overloaded gives more opportunities for latency-critical tasks to cut in the run queue. Controlling the clock better reflects the actual load than the prior approach of stretching the time-space when overloaded. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-18 15:53:38 +09:00
Changwoo Min	9bc20f9160	scx_lavd: maintain ineligible runnable tasks separately We now maintain two run queues—an eligible run queue (DSQ) and an ineligible run queue (rbtree)—sorted by the task's virtual deadline. When the eligible run queue is empty, or the ineligible run queue has not been consumed for too long (e.g., 15 msec), a task in the ineligible run queue is moved to the eligible run queue for execution. With these two queues, we have a better admission control. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-17 23:46:11 +09:00
I Hsin Cheng	2525b94af4	scx_rusty: Remove unused variable Remove unused variable "has_preferred_dom". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-17 20:30:17 +08:00
I Hsin Cheng	bf2f0fbf35	scx_rusty: Remove skip_while in find_first_candidate Followed commit `1c3b563`, move the checking of task.migrated.get() into the vector filter. In this way, we can remove the skip_while() call in find_first_candidate(). Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-17 20:27:12 +08:00
Changwoo Min	55e19ea5df	scx_lavd: do not prioritize a wake-up task in ops.select_cpu() This is a prep for adding an ineligible DSQ. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-17 11:16:02 +09:00
Changwoo Min	c84b73e971	scx_lavd: rename LAVD_GLOBAL_DSQ to LAVD_ELIGIBLE_DSQ This is a prep to add a global ineligible dsq. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-17 10:34:34 +09:00
Daniel Müller	565aec3662	rust: Update libbpf-rs & libbpf-cargo to 0.24 Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated skeletons now contain directly accessible map and program objects, no longer necessitating the use of accessor methods. As a result, the risk for mutability conflicts is reduced greatly. Signed-off-by: Daniel Müller <deso@posteo.net>	2024-07-16 11:48:52 -07:00
Daniel Hodges	27122a8a00	scx_rusty: refactor mempolicy handling bpf code and load balancing This change refactors some of the helper methods for getting the preferred node for tasks using mempolicy. The load balancing logic in try_find_move_task is updated to allow for a filter, which is used to filter for tasks with a preferred mempolicy. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 09:40:00 -07:00
Daniel Hodges	43a263aa75	scx_rusty: Use preferred node mask with balancer Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 08:11:19 -07:00
Daniel Hodges	bab6e9523c	scx_rusty: Add mempolicy checks to rusty This change makes scx_rusty mempolicy aware. When a process uses set_mempolicy it can change NUMA memory preferences and cause performance issues when tasks are scheduled on remote NUMA nodes. This change modifies task_pick_domain to use the new helper method that returns the preferred node id. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 08:11:19 -07:00
Changwoo Min	971bb2e024	scx_lavd: pretty formatting for ineligible duration Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:54:15 +09:00
Changwoo Min	adfbf3934c	scx_lavd: tuning the max ineligible duration Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:52:23 +09:00
Changwoo Min	eff444516f	scx_lavd: directly measure service time for eligibility enforcement Estimating the service time from run time and frequency is not incorrect. However, it reacts slowly to sudden changes since it relies on the moving average. Hence, we directly measure the service time to enforce fairness. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:48:26 +09:00
I Hsin Cheng	1c3b563caf	scx_rusty: Pre-check task domain mask with pull domain mask Instead of performing domain mask checking inside "find_first_candidate()" every time, check whether the tasks within push domain are abled to run on pull domain by performing the mask check at vector generation stage. This way can also avoid repeated computation generated by the same (task, pull_dom) pair as they'll try to check whether the pull domain is in the task domain mask. Also since whether a task is a kworker won't change in time, we can perform the check earlier and put it in the filter, too. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-16 21:48:06 +08:00
Tejun Heo	51334b5c4d	Bump versions for 1.0.1 release	2024-07-15 13:21:52 -10:00
Andrea Righi	8e7a526356	scx_bpfland: use nr_cpu_ids for consistency We always use nr_cpu_ids to represent the maximum CPU id returned by scx_bpf_nr_cpu_ids(). Replace cpu_max with nr_cpu_ids to be more consistent with the rest of the code. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 08:44:35 +02:00
Andrea Righi	33d06f653b	scx_bpfland: get rid of the MAX_CPUS hard-coded limit We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU DSQs, eliminating the need for the hard-coded limit MAX_CPUS. In this way scx_bpfland can support the same amount of CPUs that the kernel can handle. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:30 +02:00
Andrea Righi	b80ef7d8eb	scx_bpfland: optimize offline CPU handling Instead of constantly checking the need to drain tasks from the DSQs of the offline CPUs, provide an atomic flag to notify when there are tasks to be drained from the offline CPUs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:23 +02:00
Andrea Righi	0530706710	scx_bpfland: properly initialize the nvcsw metrics Initialize the number of voluntary context switches metrics in the local task storage. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:16:10 +02:00
Andrea Righi	bf4ad23599	scx_bpfland: refine interactive tasks flood safeguard Refine the safeguard mechanism to avoid generating too many interactive tasks in the system, which could nullify the effect of the interactive/regular task classification. The safeguard mechanism operates by pausing the promotion of new tasks to interactive status during the task wake-up process, whenever the number of interactive tasks in the priority queue exceeds a specific limit (set to 4x the number of online CPUs). Halting the promotion of additional interactive tasks allows to prioritize those already classified as interactive, thereby preventing potential "bursts" of excessive interactive tasks in the system. This refines the mitigation already provided by commit `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost"). Fixes: `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost") Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:11:34 +02:00
Andrea Righi	eb1cf0e670	scx_bpfland: improve task time slice evaluation Always assign the maximum time slice if there are idle CPUs in the system. Otherwise, double the task's unused time slice to reward tasks that use less CPU time and at the same time refill the time slice of the tasks every time they're dispatched. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-14 23:24:24 +02:00
Tejun Heo	3ae76acd12	Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0 Sync to upstream kernel and bump to 1.0	2024-07-14 07:00:38 -10:00
Changwoo Min	5b2112dd81	Merge pull request #421 from multics69/lavd-metrics scx_lavd: improve time slice and waker freq calculation	2024-07-14 18:49:36 +09:00
Tejun Heo	761ec142ce	Bump most versions to 1.0.0 sched_ext is about to be merged upstream. There are some compatibility breaking changes and we're making the current sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") the baseline. Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early development and is currently temporarily disabled, only the patchlevel is bumped.	2024-07-12 11:34:14 -10:00
Tejun Heo	54c487731a	Update to vmlinux-v6.10-rc2-g1edab907b57d.h Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely will be the commit which will be merged during the upcoming kernel v6.11 merge window. Unfortunately, this is a compatibility breaking change. As the size of bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and scx_layered - won't be able to run on older kernels. Likewise, older binaries from before this commit won't be able to run on newer kernels.	2024-07-12 11:13:34 -10:00
Tejun Heo	f261d0f037	Sync from kernel - 1edab907b57d Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11 - cgroup support hasn't landed in the upstream kernel yet. This most likely will happen in a few weeks. For the time being, disable scx_flatcg, scx_pair and scx_mitosis. - Compat macro for DSQ task iterator dropped. This is now a part of the baseline. - scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being discussed. Dropped example usage from tools/sched_ext. None of the practical schedulers use it, so this should be fine for now. - scx_bpf_cpu_rq() added. - AUTOATTACH workaround for newer libbpf versions added.	2024-07-12 11:08:41 -10:00
Changwoo Min	512bd143a5	scx_lavd: count only related tasks in calculating waker_freq A task can become a runnable on any task's context not only its waker task. Thus, we should not count wake-up on unrelated task's context. With this commit, the scheduler can (much more) accurately detect waker-wakee relationsships. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 22:51:09 +09:00
Changwoo Min	95733f63ab	scx_lavd: calculate time slice as a function of run queue length The prior approach using the sum of weights gives too much penalty to nice tasks with large nice values. With this commit, the time slice is determined by the number of runnable tasks regardless of nice priority. Note that the fairness will still be enforced based on tasks' nice priorities (weights). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 22:45:22 +09:00
Changwoo Min	00fdc1d949	Merge pull request #417 from multics69/lavd-vdeadline scx_lavd: improve virtual deadline and current clock handling	2024-07-12 14:05:44 +09:00
Changwoo Min	d4bc92bea7	scx_lavd: print lat_cri to output Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 13:23:56 +09:00
Changwoo Min	4c5c564523	scx_lavd: initial current logical clock to zero To easily distinguish, let's initialize the current logical clock to zero (not the current physical time). Also, avoid the deadline calculation being zero by adding +1 here and there. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 10:15:54 +09:00
Andrea Righi	640bd562ff	scx_bpfland: prevent tasks from abusing interactive priority boost The priority boost for interactive tasks can be exploited to render the system nearly unresponsive by creating numerous tasks that constantly switch between wait/wakeup states. For example, stress tests like `hackbench -l 10000` can significantly degrade system responsiveness. To mitigate this, limit the number of interactive tasks added to the priority queue to 4x the number of online CPUs. This simple approach appears to be a quite effective at identifying potential spam of "fake" interactive tasks, while still prioritizing legitimate interactive tasks. Additionally, periodically refresh the interactive status of the tasks based on their most recent average of voluntary context switches, preventing the interactive status from being too "sticky". Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:55 +02:00
Andrea Righi	1babb2b92d	scx_bpfland: prevent per-CPU kthreads starving other tasks Avoid dispatching per-CPU kthreads directly, since this may cause interactivity problems or unfairness, for example if there are too many softirqs being scheduled (e.g., in presence of high RX network traffic or when running certain stress tests, like hackbench). Moreover, in order to help with testing and benchmarks, introduce the option --local-kthread, that allows to restore the old behavior if enabled. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:48 +02:00
Andrea Righi	c3ebdd338f	scx_bpfland: prevent slice delta overflow When updating the task vruntime, ensure the time slice delta is always a positive value. Failing to do so may cause the global vruntime to increase excessively due to overflows. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	f59aa52fe7	scx_bpfland: expose the amount of online CPUs Periodically report the amount of online CPUs to stdout. The online CPUs are initially evaluated looking at the online cpumask, then the value is updated in the .cpu_offline() / .cpu_online() callbacks. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	3a47b484af	scx_bpfland: report interactive tasks to stdout Keep track of the CPUs that are running interactive tasks and report their amount to stdout. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	1a1a16b9e9	scx_bpfland: fix typo in slice_ns definition The correct default value of slice_ns 5ms, not 5s. This change doesn't really make any difference in practice, since these values are changed by the Rust part when the scheduler is started, but it's good to keep this aligned to the proper values for consistency. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Changwoo Min	bdbfeb9fd1	scx_lavd: use logical current clock for virtual deadlines This commit changes the use of a physical clock to a virtual, logical clock in calculating deadlines. - The virtual current clock advances upon a task's running to its virtual deadline. - When enqueuing a task, its virtual deadline from the virtual current clock is calculated. With the above two changes, this guarantees that there is no such task whose virtual deadline is smaller than the virtual current clock. This means any enqueuing task can compete with any other already enqueued tasks. This allows a latency-critical task to be immediately scheduled if needed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 22:41:56 +09:00
Changwoo Min	408ea7892c	scx_lavd: induce sched_prio_to_latency_weight from slice weight So sched_prio_to_latency_weight is removed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:37:21 +09:00
Changwoo Min	bd964acff6	scx_lavd: deprioritize a newly forked task in latency Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:36:32 +09:00
Changwoo Min	48debe416e	scx_lavd: tuning the deadline equation under high load Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:54 +09:00
Changwoo Min	c72e063680	scx_lavd: do not use lat_prio_to_greedy_thresholds With other optimizations, lat_prio_to_greedy_thresholds is not effective any more. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:01 +09:00
Changwoo Min	9ed488798e	scx_lavd: use task's runtime to determine its deaddline It has an effect of further perferring shorter jobs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:34:25 +09:00
Changwoo Min	e081b2a294	scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:33:56 +09:00
Andrea Righi	995577762a	scx_bpfland: refill task time slice Every time we need to dispatch a task re-evalate its time slice as: (unused_time_slice + min_time_slice) / 2 This allows to refill the time slice for tasks that haven't used much of their previously assigned time, improving fairness. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	6a64182ef2	scx_bpfland: always classify interactive tasks Make sure to always classify interactive tasks, even when the system is not fully utilized. This ensures that if the system suddenly becomes overloaded, we already know which tasks need to be dispatched to the priority DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	8dd528abfd	scx_bpfland: pass enqueue flags when dispatching kthreads Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:10 +02:00
Andrea Righi	fc0d1bd003	Merge pull request #415 from sched-ext/bpfland-output scx_bpfland: additional stats and output improvements	2024-07-05 19:50:07 +02:00
Tejun Heo	af5e89e73c	Merge pull request #412 from vax-r/flatcg_delta_fetch scx_flatcg: Make good use of __sync_fetch_and_sub()	2024-07-05 07:39:22 -10:00
Tejun Heo	14d0a0ef64	Merge pull request #411 from vax-r/Fix_typo scx_flatcg: Fix_typo	2024-07-05 07:35:31 -10:00
Andrea Righi	2bc8f800e7	scx_bpfland: report build id version Use the version string provided by scx_utils:build_id. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	bdb31e98e2	scx_bpfland: show statistics in a more human-readable format Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	f9d7844b2e	scx_bpfland: split direct dispatches and kthread dispatches Show separate statistics for direct dispatches and kthread direct dispatches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:27:59 +02:00
I Hsin Cheng	aae826b1b3	scx_flatcg: Make good use of __sync_fetch_and_sub() Fetch the value of "delta" directly from the returned value from __sync_fetch_and_sub, as it returns the origin value of cgc->cvtime_delta. Additional fetching instruction of cgc->cvtime_delta would be redundant here. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-05 01:03:20 +08:00
I Hsin Cheng	3e52761487	scx_flatcg: Fix_typo Fix "oppotunistic" to "opportunistic". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-04 22:04:40 +08:00
Andrea Righi	cfe2ed063d	scx_bpfland: time-based starvation prevention Tasks are consumed from various DSQs in the following order: per-CPU DSQs => priority DSQ => shared DSQ Tasks in the shared DSQ may be starved by those in the priority DSQ, which in turn may be starved by tasks dispatched to any per-CPU DSQ. To mitigate this, record the timestamp of the last task scheduling event both from the priority DSQ and the shared DSQ. If the starvation threshold is exceeded without consuming a task, the scheduler will be forced to consume a task from the corresponding DSQ. The starvation threshold can be adjusted using the --starvation-thresh command line parameter (default is 5ms). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:52:39 +02:00
Andrea Righi	9e0db4ae17	scx_bpfland: remove unnecessary RCU read protection There is no need to RCU protect the cpumask for the offline CPUs: it is created once when the scheduler is initialized and it's never deallocated. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	cef6ca93cf	scx_bpfland: adjust default time slice to 5ms Reduce the default time slice down to 5ms for a faster reaction and system responsiveness when the system is overcomissioned. This also helps to provide a more predictable level of performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	7d15e3171c	scx_bpfland: ensure task time slice never exceeds the slice_ns limit Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	e8a4d350ad	scx_bpfland: unify dispatching kthreads with direct CPU dispatches Always use direct CPU dispatch for kthreads, there is no need to treat kthreads in a special way, simply reuse direct CPU dispatch to prioritize them. Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(), since we may dispatch multiple tasks to the same per-CPU DSQ now. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 09:38:43 +02:00
Andrea Righi	d2231b0aed	scx_bpfland: drop built-in idle CPU selection logic Small refactoring of the idle CPU selection logic: - optimize idle CPU selection for tasks that can run on a single CPU - drop the built-in idle selection policy and completely rely on the custom one Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 08:54:17 +02:00
Andrea Righi	7c355f50b2	scx_bpfland: use the right cpumask to find any idle CPU We are incorrectly using the SMT idle cpumask to find any idle CPU, fix by using the generic idle cpumask. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-01 20:36:24 +02:00
Andrea Righi	c458f345b4	Merge pull request #408 from sched-ext/bpfland-cpu-hotplug scx_bpfland: support CPU hotplugging	2024-07-01 19:41:00 +02:00
Dan Schatzberg	32ac4b2cff	Merge pull request #389 from dschatzberg/mitosis mitosis: Update synchronization	2024-07-01 09:44:26 -04:00
Andrea Righi	ff7a518d28	scx_bpfland: support CPU hotplugging Implement CPU hotplugging in scx_bpfland without restarting the scheduler. The idle selection logic has been updated to consider online CPUs. Additionally, a cpumask for offline CPUs has been introduced. Tasks that have been dispatched to the DSQs associated with offline CPUs are consumed by the other CPUs that are still online. Moreover, the dependency on the Topology crate is temporarily dropped and instead, /sys/devices/system/cpu/smt/active is used to determine if SMT should be taken into account during idle selection. The Topology crate will be re-introduced later when scx_bpfland will gain more topology-aware capabilities. This fixes #406. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 23:04:13 +02:00

... 5 6 7 8 9 ...

1185 Commits