JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-12-01 21:37:12 +00:00

Author	SHA1	Message	Date
Daniel Hodges	c224154866	Merge pull request #459 from hodgesds/layer-cpu-counter scx_layered: Add per cpu layer iterator offset	2024-07-30 16:00:37 -04:00
Daniel Hodges	4f12bebaa5	scx_layered: Add per cpu layer iterator offset Add a per cpu counter offset to round robin when iterating on layers. This is to make selection from different layers more fair. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-30 10:44:41 -07:00
Changwoo Min	9b455cf010	Merge pull request #458 from sched-ext/lavd-fix-cpu-ctx-size scx_lavd: set correct size for cpu_ctx_stor	2024-07-31 00:39:13 +09:00
Changwoo Min	6136cbee65	scx_lavd: tuning the time slice and preemption margins Tuning the time slice under high load and change the kick/tick margins for preemption more conservative. Especially, aggressive IPI-based preemption (kick) causes performance unstability. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:30:59 +09:00
Changwoo Min	35b0d9f3c2	scx_lavd: improve starvation factor equation Instead of using coarse-grained log(), let's directly use the ratio of task's service time. Also, the virtual dealine equation is also updated to reflect this change. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:27:17 +09:00
Changwoo Min	f9657a549f	scx_lavd: fix bpf verification error in old kernel versions Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:22:43 +09:00
Changwoo Min	d2615b4975	scx_lavd: fix warnings from the rust code Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-31 00:21:32 +09:00
Andrea Righi	2015faa745	scx_lavd: set correct size for cpu_ctx_stor The max_entries parameter in BPF_MAP_TYPE_PERCPU_ARRAY defines the number of values per CPU and for cpu_ctx_stor we only need one item: the CPU context. Set max_entries to 1 to avoid allocating unnecessary memory and slightly reduce the memory footprint. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-30 09:32:55 +02:00
Changwoo Min	643edb5431	Merge pull request #457 from multics69/lavd-amp-v2 scx_lavd: support two-level scheduling for heavy-loaded cases (like bpfland)	2024-07-30 10:39:06 +09:00
Changwoo Min	b91c1e4759	scx_lavd: add more comments on no_2_level_scheduling implementation Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-29 12:22:28 +09:00
Changwoo Min	f71fff9bbe	scx_lavd: print a warning message when system does not provide a proper freq info Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:53:02 +09:00
Changwoo Min	4449d8e31c	scx_lavd: incorporate a task's static priority in calculating its latency criticality That's because static (nice) priority is a strong hint to distinguish latency-critical tasks. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:41:43 +09:00
Changwoo Min	221f1fe12a	scx_lavd: further prioritize producers over consumers That is because many latency-critical tasks are producers. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:38:54 +09:00
Changwoo Min	7106e8cdca	scx_lavd: support two-level scheduling for heavy-loaded cases We introduce two-level scheduling similar to scx_bpfland. The two-level scheduling consists of two DSQs: 1) latency-critical run queue and 2) regular run queue. The scheduler prioritizes scheduling tasks on the latency-critical queue but makes its best effort to schedule tasks on the regular queue. The scheduler could be more resilient under heavy load by segregating regular, non-latency-critical tasks from latency-critical tasks. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:33:17 +09:00
Changwoo Min	9236c3e57c	scx_lavd: increase the targeted latency for heavy loaded cases Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 15:30:01 +09:00
Changwoo Min	230512208d	scx_lavd: fix div by zero error in some installations The max frequency information from topology (from sysfs) seems not always true. In some installations, it returns zero for all CPUs. In this case, let's just consider all CPUs have the same capacity (1024), hoping the kernel can give more preceise information. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 12:47:00 +09:00
Changwoo Min	59e54f4972	scx_lavd: print how to disable logging Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-28 12:31:51 +09:00
Changwoo Min	df1108ec6c	scx_lavd: segregate starvation factor from the latency criticality (refactoring) Latency criticality is a task's inherent property, but the starvation factor is its dynamic status for the urgency of scheduling. Hence, we segregate the starvation factor out. Also, cleaned up unnecessary arguments and struct fields related. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-27 17:25:39 +09:00
Changwoo Min	d4a5a629ff	Merge pull request #452 from multics69/lavd-core-compaction-v2 lavd_lavd: initial support for AMP (asynmmetric multi-processor) architecture	2024-07-27 16:22:27 +09:00
Changwoo Min	eeea847697	scx_lavd: adjust time slice based on CPU's capacity When a task is running on more performant core, the scheduler will give a longer time slice. On the other hand, on a less performant core, a shorter time slice will be assigned. The longer time slice helps boosting clock frequency on a performant core. Also, the shorter time slice gives more chance the performant core being utilized. Regarding the CPU capacity, we first check if kernel-provided capacitiy values are trustworthy or not. If not (i.e., all the same values), we rely on the user-provided value, based on each CPU's maximum clock frequency. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	e7b6ed1838	scx_lavd: add --prefer-smt-core option With the --prefer-smt-core option is on, the core compaction prefers to utilizae hyper-twin first before utilizing the other physical CPUs. By default, the option is off. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	19e337cd9b	scx_lavd: make the core compaction AMP-aware Previously, the core compaction assumed that each core's capacity was the same. Now, we additionally consider each core's max clock frequency. So, it always tries to use the higher-frequency cores first. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	dbb3957eb1	scx_lavd: add a missing no_freq_scaling option check Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	90b57a3fd7	scx_lavd: put a pinned kernel task to an overflow set Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Changwoo Min	e76bf999df	scx_lavd: clean up constants (no functional changes) Remove unused constants and rename outdated constants to proper names (LAVD_TC_* to LAVC_CC_* and LAVD_ELIGIBLE_DSQ to LAVD_GLOBAL_DSQ). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-26 18:46:21 +09:00
Andrea Righi	19854f1535	scx_bpfland: allow to specify negative values with --slice-us-lag Using negative values with --slice-us-lag can be useful to make performance more consistent and prioritize newly created tasks over the running tasks. Therefore, allow to specify negative values from the command line and also update the documentation of this option. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-26 09:10:18 +02:00
David Vernet	5401876430	Revert "rusty: Rework deadline as a signed sum"	2024-07-25 14:50:45 -05:00
David Vernet	09536aa15d	Merge pull request #309 from sched-ext/rusty_improved_dl rusty: Rework deadline as a signed sum	2024-07-25 13:44:54 -05:00
David Vernet	c1ad602ce5	rusty: Transfer latency priority between CPU-intensive and interactive tasks In some scenarios, a CPU-intensive task may be on the critical path for interactive workloads. For example, you may have a game with CPU-intensive tasks that are crunching the logic for the game, and that's required for the game to proceed without being choppy. To support such workflows, this change adds logic to allow a non-interactive task to inherit the lower (i.e. stronger) latency priority of another task if it wakes or is woken by that task. Signed-off-by: David Vernet <void@manifault.com>	2024-07-25 11:55:40 -05:00
David Vernet	933ea9baa1	rusty: Rework deadline as a signed sum Currently, a task's deadline is computed as its vtime + a scaled function of its average runtime (with its deadline being scaled down if it's more interactive). This makes sense intuitively, as we do want an interactive task to have an earlier deadline, but it also has some flaws. For one thing, we're currently ignoring duty cycle when determining a task's deadline. This has a few implications. Firstly, because we reward tasks with higher waker and blocked frequencies due to considering them to be part of a work chain, we implicitly penalize tasks that rarely ever use the CPU because those frequencies are low. While those tasks are likely not part of a work chain, they also should get an interactivity boost just by pure virtue of not using the CPU very often. This should in theory be addressed by vruntime, but because we cap the amount of vtime that a task can accumulate to one slice, it may not be adequately reflected after a task runs for the first time. Another problem is that we're minimizing a task's deadline if it's interactive, but we're also not really penalizing a task that's a super CPU hog by increasing its deadline. We sort of do a bit by applying a higher niceness which gives it a higher deadline for a lower weight, but its somewhat minimal considering that we're using niceness, and that the best an interactive task can do is minimize its deadline to near zero relative to its vtime. What we really want to do is "negatively" scale an interactive task's deadline with the same magnitude as we "positively" scale a CPU-hogging task's deadline. To do this, we make two major changes to how we compute deadline: 1. Instead of using niceness, we now instead use our own straightforward scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we can and should improve this in the future. 2. We now create a _signed_ linear latency priority factor as a sum of the three following inputs: - Work-chain factor (log_2 of product of blocked freq and waker freq) - Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle -- higher duty cycle means lower factor) - Average runtime factor (Higher avg runtime means higher average runtime factor) We then compute the latency priority as: lat_prio := Average runtime factor - (work-chain factor + duty cycle factor) This gives us a signed value that can be negative. With this, we can compute a non-negative weight value by calculating a weight from the absolute value of lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive, it's the task's vtime PLUS scaled slice_ns. This ends up working well because you get a higher weight both for highly interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets you scale a task's deadline "more negatively" for interactive tasks, and "more positively" for the CPU hogs. With this change, we get a significant improvement in FPS. On a 7950X, if I run the following workload: $ stress-ng -c $((8 * $(nproc))) 1. I get 60 FPS when playing Stellaris (while time is progressing at max speed), whereas EEVDF gets 6-7 FPS. 2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The Civ6 benchmark doesn't even start after over 4 minutes in the initial frame with EEVDF, but gets us 13s / turn with rusty. 3. It seems that EEVDF has improved with Terraria in v6.9. It was able to maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past. rusty is still able to maintain a solid 60-62FPS consistently with no problem, however.	2024-07-25 11:55:03 -05:00
Daniel Hodges	4c3fd6cd9b	scx_layered: Rename UserId and GroupId TLDR; rename UserId and GroupId to UIDEquals and GIDEquals. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 15:09:08 -07:00
Daniel Hodges	55f6d68eef	scx_layered: Add user and group layers Add a layer match based on either the effective user id or the effective group id. This allows for creating layers for individual users or groups. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 15:09:08 -07:00
Daniel Hodges	4042fc42d7	Merge pull request #446 from hodgesds/layered-topo scx_layered: Add topology awareness for NUMA nodes and LLCs	2024-07-24 18:06:43 -04:00
Daniel Hodges	2803f9c127	scx_layered: Fix formatting issues Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 14:39:02 -07:00
Daniel Hodges	0814abf0b8	scx_layered: Add node topology awareness Add NUMA node topology awareness for scx_layared. This borrows some of the NUMA handling from scx_rusty and allows layers to set a node mask. Different layer kinds will use the node mask differently. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-24 09:53:48 -07:00
Daniel Müller	98af514972	scx_rusty: Simplify LoadBalancer::populate_tasks_by_load() Simplify LoadBalancer::populate_tasks_by_load() by cutting out the heap allocation bits, by moving mutable accesses in front of immutable ones. Because multiple immutable accesses (between bss and rodata) do not conflict, we don't need the intermediate PID storage. Signed-off-by: Daniel Müller <deso@posteo.net>	2024-07-23 13:59:26 -07:00
Andrea Righi	46ddca6bd5	scx_bpfland: report task time slice to stdout Periodically report to stdout samples of the effective time slice applied to tasks. While one could determine this metric by examining the max slice_ns and nr_waiting metrics, directly reporting it to stdout allows users to quickly identify what is happening and it provides a clearer overview of the scheduling behavior. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	c1d93d2a00	scx_bpfland: drop kthread dispatches metric Dispatching per-CPU kthreads directly is disabled by default, reporting this metric can generate some confusion (since it is always 0), and even if local kthread dispatches are enabled, they should be still considered as regular direct dispatches (there is no difference in practice). Therefore, merge direct kthread dispatches into direct dispatches and drop the separate nr_kthread_dispatches metric. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	a5f1d6b595	scx_bpfland: show average amount of tasks waiting to be dispatched Periodically report the average amount of tasks sitting in the priority and shared DSQs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:45 +02:00
Andrea Righi	5908a985bc	scx_bpfland: adjust task time slice based on the amount of waiting tasks Scale the task's time slice based on the average amount of tasks that are currently waiting to be dispatched. Use a moving average for the amount of waiting tasks to smooth out potential spikes caused by temporary bursts of tasks piling in the wait queues. This was initially modeled in scx_rustland and it seems to work pretty well also in scx_bpfland now. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 21:53:25 +02:00
Changwoo Min	af75d147c8	Merge pull request #443 from multics69/lavd-vtime scx_lavd: overhaul the virtual deadline algorithm	2024-07-21 18:00:57 +09:00
Changwoo Min	a9aab6b229	scx_lavd: fix typo Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-21 17:58:44 +09:00
Changwoo Min	add96f0e18	scx_lavd: do not maintain ineligible runnable tasks separately With all the other optimizations and tunings, it turns out that maintaining two runqueues has more harm than good. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 17:49:12 +09:00
Changwoo Min	827187d213	scx_lavd: adjust ineligible duration according to task's lat_cri Further depenalize above-average latency-critical tasks and penalize further below-avergage latency-critical tasks in ineligibility duration. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 17:37:27 +09:00
Changwoo Min	c653622ed9	scx_lavd: add LAVD_VDL_LOOSENESS_FT in calculating virtual deadline LAVD_VDL_LOOSENESS_FT represents how loose the deadline is. The smaller value means the deadline is tighter. While it is unlikely to be tuned, let's keep it as a tunable for now. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 12:00:50 +09:00
Changwoo Min	e94070d5ca	scx_lavd: remove LAVD_BOOST_* These are no longer necessary after directly using latency criticality. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 11:53:20 +09:00
Changwoo Min	43f0fcb87c	scx_lavd: removed unused LAVD_LOAD_FACTOR_* These are no longer necessary after remnoving load factor calculation. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 11:51:12 +09:00
David Vernet	4f11e2abe2	layered: Don't dispatch to LO_FALLBACK_DSQ Non-kthreads with custom affinities in non-open layers are dispatched into a LO_FALLBACK_DSQ, with the idea being that they're penalized for their custom affinities. When a host is fully utilized, these tasks can end up being starved due to LO_FALLBACK_DSQ being consumed only when there are no other layers to consume from. In internal workloads at Meta, we've observed that this can happen in practice. Longer term, we can probably address this by implementing layer weights and applying that to fallback DSQs to avoid starvation. For now, let's just dispatch them to HI_FALLBACK_DSQ to avoid this starvation issue. Signed-off-by: David Vernet <void@manifault.com>	2024-07-19 19:14:18 -05:00
Changwoo Min	3924ebaa4d	scx_lavd: properly synchronize taskc->vdeadline_log_clk Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 01:41:29 +09:00
Changwoo Min	02ad43d116	scx_lavd: directly use p->scx.weight instead load_ideal Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 00:25:11 +09:00
Changwoo Min	c955caefd8	scx_lavd: drop sys_load_factor In theory, sys_load_factor should not be necessary since we do not stretch the time space anymore. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-20 00:10:29 +09:00
Changwoo Min	67a6deb983	scx_lavd: use lat_cri instead of lat_prio universally Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 23:56:51 +09:00
Daniel Hodges	b98a9f56a8	scx_layered: Add separate module for metrics Refactor the main module for scx_layered to move metrics into a separate module. This change does no functional differences, only code structure. This will make it a little easier to navigate the logic in the main scheduler code. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-19 07:40:24 -07:00
Changwoo Min	6f10d6907c	scx_lavd: drop sched_prio_to_slice_weight[] table Use p->scx.weight instead. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 22:39:01 +09:00
Changwoo Min	034303f00f	scx_lavd: consider starvation factor in determining latency criticality Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 22:17:50 +09:00
Daniel Hodges	d974690b5d	Merge pull request #435 from vax-r/remove_skip_while scx_rusty: Remove skip_while in find_first_candidate	2024-07-19 08:38:58 -04:00
Changwoo Min	99e0d21c3c	scx_lavd: drop the runtime factor in calculating latency criticality That is okay since the runtime is considered in calculating a virtual deadline. A shorter runtime will result in a tighter deadline linearly. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 17:28:40 +09:00
Changwoo Min	b90599e967	scx_lavd: do not inherit parent's properties If inheriting the parent's properties, a new fork task tends to be too prioritized. That is, many parent processes, such as `make,` are a bit more latency-critical than average. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-19 15:29:13 +09:00
Andrea Righi	c4eb3ce7b4	scx_bpfland: introduce dynamic nvcsw threshold Instead of using a static value to classify tasks based on their average amount of voluntary context switches, try to periodically evaluate an optimal threshold, based on a global average of voluntary context switches among of all the running tasks. Tasks with an average amount of voluntary context switches greater than the global average will be classified as interactive. The global average is evaluated as an exponentially weighted moving average (EWMA), as: avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25 This approach is more efficient than iterating through all tasks and it helps to prevent rapid fluctuations that may be caused by bursts of voluntary context switch events. The dynamic nvcsw threshold enables a more precise adjustment of the classification criteria to swiftly respond to global system changes: tasks can be quickly classified as interactive, but if the system experiences too many interactive events, the criteria for maintaining interactive status become stricter. This creates a natural selection process where only the most deserving tasks remain interactive. Additionally, introduce the new option `--nvcsw-max-thresh N`, which allows to extend or restrict the fluctuation range of the global average threshold for voluntary context switches. Tested-by: Piotr Gorski <piotrgorski@cachyos.org> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-18 19:03:25 +02:00
Changwoo Min	78d96a6fb6	scx_lavd: advance clock by reverse proportional to the system load Advancing the clock slower when overloaded gives more opportunities for latency-critical tasks to cut in the run queue. Controlling the clock better reflects the actual load than the prior approach of stretching the time-space when overloaded. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-18 15:53:38 +09:00
Changwoo Min	9bc20f9160	scx_lavd: maintain ineligible runnable tasks separately We now maintain two run queues—an eligible run queue (DSQ) and an ineligible run queue (rbtree)—sorted by the task's virtual deadline. When the eligible run queue is empty, or the ineligible run queue has not been consumed for too long (e.g., 15 msec), a task in the ineligible run queue is moved to the eligible run queue for execution. With these two queues, we have a better admission control. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-17 23:46:11 +09:00
I Hsin Cheng	2525b94af4	scx_rusty: Remove unused variable Remove unused variable "has_preferred_dom". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-17 20:30:17 +08:00
I Hsin Cheng	bf2f0fbf35	scx_rusty: Remove skip_while in find_first_candidate Followed commit `1c3b563`, move the checking of task.migrated.get() into the vector filter. In this way, we can remove the skip_while() call in find_first_candidate(). Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-17 20:27:12 +08:00
Changwoo Min	55e19ea5df	scx_lavd: do not prioritize a wake-up task in ops.select_cpu() This is a prep for adding an ineligible DSQ. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-17 11:16:02 +09:00
Changwoo Min	c84b73e971	scx_lavd: rename LAVD_GLOBAL_DSQ to LAVD_ELIGIBLE_DSQ This is a prep to add a global ineligible dsq. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-17 10:34:34 +09:00
Daniel Müller	565aec3662	rust: Update libbpf-rs & libbpf-cargo to 0.24 Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated skeletons now contain directly accessible map and program objects, no longer necessitating the use of accessor methods. As a result, the risk for mutability conflicts is reduced greatly. Signed-off-by: Daniel Müller <deso@posteo.net>	2024-07-16 11:48:52 -07:00
Daniel Hodges	27122a8a00	scx_rusty: refactor mempolicy handling bpf code and load balancing This change refactors some of the helper methods for getting the preferred node for tasks using mempolicy. The load balancing logic in try_find_move_task is updated to allow for a filter, which is used to filter for tasks with a preferred mempolicy. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 09:40:00 -07:00
Daniel Hodges	43a263aa75	scx_rusty: Use preferred node mask with balancer Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 08:11:19 -07:00
Daniel Hodges	bab6e9523c	scx_rusty: Add mempolicy checks to rusty This change makes scx_rusty mempolicy aware. When a process uses set_mempolicy it can change NUMA memory preferences and cause performance issues when tasks are scheduled on remote NUMA nodes. This change modifies task_pick_domain to use the new helper method that returns the preferred node id. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-07-16 08:11:19 -07:00
Changwoo Min	971bb2e024	scx_lavd: pretty formatting for ineligible duration Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:54:15 +09:00
Changwoo Min	adfbf3934c	scx_lavd: tuning the max ineligible duration Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:52:23 +09:00
Changwoo Min	eff444516f	scx_lavd: directly measure service time for eligibility enforcement Estimating the service time from run time and frequency is not incorrect. However, it reacts slowly to sudden changes since it relies on the moving average. Hence, we directly measure the service time to enforce fairness. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-16 23:48:26 +09:00
I Hsin Cheng	1c3b563caf	scx_rusty: Pre-check task domain mask with pull domain mask Instead of performing domain mask checking inside "find_first_candidate()" every time, check whether the tasks within push domain are abled to run on pull domain by performing the mask check at vector generation stage. This way can also avoid repeated computation generated by the same (task, pull_dom) pair as they'll try to check whether the pull domain is in the task domain mask. Also since whether a task is a kworker won't change in time, we can perform the check earlier and put it in the filter, too. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-16 21:48:06 +08:00
Tejun Heo	51334b5c4d	Bump versions for 1.0.1 release	2024-07-15 13:21:52 -10:00
Andrea Righi	8e7a526356	scx_bpfland: use nr_cpu_ids for consistency We always use nr_cpu_ids to represent the maximum CPU id returned by scx_bpf_nr_cpu_ids(). Replace cpu_max with nr_cpu_ids to be more consistent with the rest of the code. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 08:44:35 +02:00
Andrea Righi	33d06f653b	scx_bpfland: get rid of the MAX_CPUS hard-coded limit We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU DSQs, eliminating the need for the hard-coded limit MAX_CPUS. In this way scx_bpfland can support the same amount of CPUs that the kernel can handle. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:30 +02:00
Andrea Righi	b80ef7d8eb	scx_bpfland: optimize offline CPU handling Instead of constantly checking the need to drain tasks from the DSQs of the offline CPUs, provide an atomic flag to notify when there are tasks to be drained from the offline CPUs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:23 +02:00
Andrea Righi	0530706710	scx_bpfland: properly initialize the nvcsw metrics Initialize the number of voluntary context switches metrics in the local task storage. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:16:10 +02:00
Andrea Righi	bf4ad23599	scx_bpfland: refine interactive tasks flood safeguard Refine the safeguard mechanism to avoid generating too many interactive tasks in the system, which could nullify the effect of the interactive/regular task classification. The safeguard mechanism operates by pausing the promotion of new tasks to interactive status during the task wake-up process, whenever the number of interactive tasks in the priority queue exceeds a specific limit (set to 4x the number of online CPUs). Halting the promotion of additional interactive tasks allows to prioritize those already classified as interactive, thereby preventing potential "bursts" of excessive interactive tasks in the system. This refines the mitigation already provided by commit `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost"). Fixes: `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost") Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:11:34 +02:00
Andrea Righi	eb1cf0e670	scx_bpfland: improve task time slice evaluation Always assign the maximum time slice if there are idle CPUs in the system. Otherwise, double the task's unused time slice to reward tasks that use less CPU time and at the same time refill the time slice of the tasks every time they're dispatched. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-14 23:24:24 +02:00
Tejun Heo	3ae76acd12	Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0 Sync to upstream kernel and bump to 1.0	2024-07-14 07:00:38 -10:00
Changwoo Min	5b2112dd81	Merge pull request #421 from multics69/lavd-metrics scx_lavd: improve time slice and waker freq calculation	2024-07-14 18:49:36 +09:00
Tejun Heo	761ec142ce	Bump most versions to 1.0.0 sched_ext is about to be merged upstream. There are some compatibility breaking changes and we're making the current sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") the baseline. Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early development and is currently temporarily disabled, only the patchlevel is bumped.	2024-07-12 11:34:14 -10:00
Tejun Heo	54c487731a	Update to vmlinux-v6.10-rc2-g1edab907b57d.h Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely will be the commit which will be merged during the upcoming kernel v6.11 merge window. Unfortunately, this is a compatibility breaking change. As the size of bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and scx_layered - won't be able to run on older kernels. Likewise, older binaries from before this commit won't be able to run on newer kernels.	2024-07-12 11:13:34 -10:00
Tejun Heo	f261d0f037	Sync from kernel - 1edab907b57d Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11 - cgroup support hasn't landed in the upstream kernel yet. This most likely will happen in a few weeks. For the time being, disable scx_flatcg, scx_pair and scx_mitosis. - Compat macro for DSQ task iterator dropped. This is now a part of the baseline. - scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being discussed. Dropped example usage from tools/sched_ext. None of the practical schedulers use it, so this should be fine for now. - scx_bpf_cpu_rq() added. - AUTOATTACH workaround for newer libbpf versions added.	2024-07-12 11:08:41 -10:00
Changwoo Min	512bd143a5	scx_lavd: count only related tasks in calculating waker_freq A task can become a runnable on any task's context not only its waker task. Thus, we should not count wake-up on unrelated task's context. With this commit, the scheduler can (much more) accurately detect waker-wakee relationsships. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 22:51:09 +09:00
Changwoo Min	95733f63ab	scx_lavd: calculate time slice as a function of run queue length The prior approach using the sum of weights gives too much penalty to nice tasks with large nice values. With this commit, the time slice is determined by the number of runnable tasks regardless of nice priority. Note that the fairness will still be enforced based on tasks' nice priorities (weights). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 22:45:22 +09:00
Changwoo Min	00fdc1d949	Merge pull request #417 from multics69/lavd-vdeadline scx_lavd: improve virtual deadline and current clock handling	2024-07-12 14:05:44 +09:00
Changwoo Min	d4bc92bea7	scx_lavd: print lat_cri to output Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 13:23:56 +09:00
Changwoo Min	4c5c564523	scx_lavd: initial current logical clock to zero To easily distinguish, let's initialize the current logical clock to zero (not the current physical time). Also, avoid the deadline calculation being zero by adding +1 here and there. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-12 10:15:54 +09:00
Andrea Righi	640bd562ff	scx_bpfland: prevent tasks from abusing interactive priority boost The priority boost for interactive tasks can be exploited to render the system nearly unresponsive by creating numerous tasks that constantly switch between wait/wakeup states. For example, stress tests like `hackbench -l 10000` can significantly degrade system responsiveness. To mitigate this, limit the number of interactive tasks added to the priority queue to 4x the number of online CPUs. This simple approach appears to be a quite effective at identifying potential spam of "fake" interactive tasks, while still prioritizing legitimate interactive tasks. Additionally, periodically refresh the interactive status of the tasks based on their most recent average of voluntary context switches, preventing the interactive status from being too "sticky". Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:55 +02:00
Andrea Righi	1babb2b92d	scx_bpfland: prevent per-CPU kthreads starving other tasks Avoid dispatching per-CPU kthreads directly, since this may cause interactivity problems or unfairness, for example if there are too many softirqs being scheduled (e.g., in presence of high RX network traffic or when running certain stress tests, like hackbench). Moreover, in order to help with testing and benchmarks, introduce the option --local-kthread, that allows to restore the old behavior if enabled. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:48 +02:00
Andrea Righi	c3ebdd338f	scx_bpfland: prevent slice delta overflow When updating the task vruntime, ensure the time slice delta is always a positive value. Failing to do so may cause the global vruntime to increase excessively due to overflows. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	f59aa52fe7	scx_bpfland: expose the amount of online CPUs Periodically report the amount of online CPUs to stdout. The online CPUs are initially evaluated looking at the online cpumask, then the value is updated in the .cpu_offline() / .cpu_online() callbacks. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	3a47b484af	scx_bpfland: report interactive tasks to stdout Keep track of the CPUs that are running interactive tasks and report their amount to stdout. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	1a1a16b9e9	scx_bpfland: fix typo in slice_ns definition The correct default value of slice_ns 5ms, not 5s. This change doesn't really make any difference in practice, since these values are changed by the Rust part when the scheduler is started, but it's good to keep this aligned to the proper values for consistency. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Changwoo Min	bdbfeb9fd1	scx_lavd: use logical current clock for virtual deadlines This commit changes the use of a physical clock to a virtual, logical clock in calculating deadlines. - The virtual current clock advances upon a task's running to its virtual deadline. - When enqueuing a task, its virtual deadline from the virtual current clock is calculated. With the above two changes, this guarantees that there is no such task whose virtual deadline is smaller than the virtual current clock. This means any enqueuing task can compete with any other already enqueued tasks. This allows a latency-critical task to be immediately scheduled if needed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 22:41:56 +09:00
Changwoo Min	408ea7892c	scx_lavd: induce sched_prio_to_latency_weight from slice weight So sched_prio_to_latency_weight is removed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:37:21 +09:00
Changwoo Min	bd964acff6	scx_lavd: deprioritize a newly forked task in latency Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:36:32 +09:00
Changwoo Min	48debe416e	scx_lavd: tuning the deadline equation under high load Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:54 +09:00
Changwoo Min	c72e063680	scx_lavd: do not use lat_prio_to_greedy_thresholds With other optimizations, lat_prio_to_greedy_thresholds is not effective any more. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:35:01 +09:00
Changwoo Min	9ed488798e	scx_lavd: use task's runtime to determine its deaddline It has an effect of further perferring shorter jobs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:34:25 +09:00
Changwoo Min	e081b2a294	scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-07-11 21:33:56 +09:00
Andrea Righi	995577762a	scx_bpfland: refill task time slice Every time we need to dispatch a task re-evalate its time slice as: (unused_time_slice + min_time_slice) / 2 This allows to refill the time slice for tasks that haven't used much of their previously assigned time, improving fairness. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	6a64182ef2	scx_bpfland: always classify interactive tasks Make sure to always classify interactive tasks, even when the system is not fully utilized. This ensures that if the system suddenly becomes overloaded, we already know which tasks need to be dispatched to the priority DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	8dd528abfd	scx_bpfland: pass enqueue flags when dispatching kthreads Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:10 +02:00
Andrea Righi	fc0d1bd003	Merge pull request #415 from sched-ext/bpfland-output scx_bpfland: additional stats and output improvements	2024-07-05 19:50:07 +02:00
Tejun Heo	af5e89e73c	Merge pull request #412 from vax-r/flatcg_delta_fetch scx_flatcg: Make good use of __sync_fetch_and_sub()	2024-07-05 07:39:22 -10:00
Tejun Heo	14d0a0ef64	Merge pull request #411 from vax-r/Fix_typo scx_flatcg: Fix_typo	2024-07-05 07:35:31 -10:00
Andrea Righi	2bc8f800e7	scx_bpfland: report build id version Use the version string provided by scx_utils:build_id. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	bdb31e98e2	scx_bpfland: show statistics in a more human-readable format Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	f9d7844b2e	scx_bpfland: split direct dispatches and kthread dispatches Show separate statistics for direct dispatches and kthread direct dispatches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:27:59 +02:00
I Hsin Cheng	aae826b1b3	scx_flatcg: Make good use of __sync_fetch_and_sub() Fetch the value of "delta" directly from the returned value from __sync_fetch_and_sub, as it returns the origin value of cgc->cvtime_delta. Additional fetching instruction of cgc->cvtime_delta would be redundant here. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-05 01:03:20 +08:00
I Hsin Cheng	3e52761487	scx_flatcg: Fix_typo Fix "oppotunistic" to "opportunistic". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-07-04 22:04:40 +08:00
Andrea Righi	cfe2ed063d	scx_bpfland: time-based starvation prevention Tasks are consumed from various DSQs in the following order: per-CPU DSQs => priority DSQ => shared DSQ Tasks in the shared DSQ may be starved by those in the priority DSQ, which in turn may be starved by tasks dispatched to any per-CPU DSQ. To mitigate this, record the timestamp of the last task scheduling event both from the priority DSQ and the shared DSQ. If the starvation threshold is exceeded without consuming a task, the scheduler will be forced to consume a task from the corresponding DSQ. The starvation threshold can be adjusted using the --starvation-thresh command line parameter (default is 5ms). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:52:39 +02:00
Andrea Righi	9e0db4ae17	scx_bpfland: remove unnecessary RCU read protection There is no need to RCU protect the cpumask for the offline CPUs: it is created once when the scheduler is initialized and it's never deallocated. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	cef6ca93cf	scx_bpfland: adjust default time slice to 5ms Reduce the default time slice down to 5ms for a faster reaction and system responsiveness when the system is overcomissioned. This also helps to provide a more predictable level of performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	7d15e3171c	scx_bpfland: ensure task time slice never exceeds the slice_ns limit Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	e8a4d350ad	scx_bpfland: unify dispatching kthreads with direct CPU dispatches Always use direct CPU dispatch for kthreads, there is no need to treat kthreads in a special way, simply reuse direct CPU dispatch to prioritize them. Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(), since we may dispatch multiple tasks to the same per-CPU DSQ now. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 09:38:43 +02:00
Andrea Righi	d2231b0aed	scx_bpfland: drop built-in idle CPU selection logic Small refactoring of the idle CPU selection logic: - optimize idle CPU selection for tasks that can run on a single CPU - drop the built-in idle selection policy and completely rely on the custom one Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 08:54:17 +02:00
Andrea Righi	7c355f50b2	scx_bpfland: use the right cpumask to find any idle CPU We are incorrectly using the SMT idle cpumask to find any idle CPU, fix by using the generic idle cpumask. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-01 20:36:24 +02:00
Andrea Righi	c458f345b4	Merge pull request #408 from sched-ext/bpfland-cpu-hotplug scx_bpfland: support CPU hotplugging	2024-07-01 19:41:00 +02:00
Dan Schatzberg	32ac4b2cff	Merge pull request #389 from dschatzberg/mitosis mitosis: Update synchronization	2024-07-01 09:44:26 -04:00
Andrea Righi	ff7a518d28	scx_bpfland: support CPU hotplugging Implement CPU hotplugging in scx_bpfland without restarting the scheduler. The idle selection logic has been updated to consider online CPUs. Additionally, a cpumask for offline CPUs has been introduced. Tasks that have been dispatched to the DSQs associated with offline CPUs are consumed by the other CPUs that are still online. Moreover, the dependency on the Topology crate is temporarily dropped and instead, /sys/devices/system/cpu/smt/active is used to determine if SMT should be taken into account during idle selection. The Topology crate will be re-introduced later when scx_bpfland will gain more topology-aware capabilities. This fixes #406. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 23:04:13 +02:00
Andrea Righi	d76551bbd3	scx_rusty: fix stats map initialization The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can result in errors such as: Error: Failed to zero stat Caused by: number of values 6 != number of cpus 8 Fix by using num_possible_cpus() to initialize it. Fixes: `263e02f6` ("rusty: Use nr_cpu_ids instead of nr_cpus_possible") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 17:37:14 +02:00
Andrea Righi	74175f5a49	scx_bpfland: properly integrate with meson build Properly honor the meson build `serialize` option. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 21:37:00 +02:00
Andrea Righi	f98c35fd07	Merge pull request #388 from sched-ext/bpfland scheds: introduce scx_bpfland	2024-06-28 21:27:43 +02:00
Andrea Righi	cf4883fbf8	meson: introduce serialize build option With commit `5d20f89a` ("scheds-rust: build rust schedulers in sequence"), schedulers are now built serially one after the other to prevent meson and cargo from forking NxN parallel tasks. However, this change has made building a single scheduler much more cumbersome, due to the chain of dependencies. For example, building scx_rusty using the specific meson target would still result in all schedulers being built, because they all depend on each other. To address this issue, introduce the new meson build option `serialize=true\|false` (default is false). This option allows to disable the schedulers' build chain, restoring the old behavior. With this option enabled, it is now possible to build just a single scheduler, parallelizing the cargo build properly, without triggering the build of the others. Example: $ meson setup build -Dbuildtype=release -Dserialize=false $ meson compile -C build scx_rusty Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 10:17:37 +02:00
Changwoo Min	24a238846e	scx_lavd: optimizing deadline related tunables The competition window was 7.5 msec, half of the targeted latency. However, it is too wide for some workloads, so unrelated tasks may compete with each other. Hence, it is tightened to about 1 msec with LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition. Also, when a system is overloaded, now the time space is stretched more aggressively (i.e., lat_prio^2) when a task's latency priority is low (high value). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-28 09:00:45 +09:00
Andrea Righi	7606b95150	scx_bpfland: introduce maximum time slice lag Introduce a tunable to set a limit of the minimum vruntime that is used when a task is dispatched, as: vtime_min = vtime_now - slice_lag_ns Increasing the time slice lag can make interactive tasks even more responsive at the cost of starving regular and newly created tasks. Default time slice lag is 0. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Andrea Righi	5a44329d45	scheds: introduce scx_bpfland Overview ======== This scheduler is derived from scx_rustland, but it is fully implemented in BFP with minimal user-space Rust part to process command line options, collect metrics and logs out scheduling statistics. Unlike scx_rustland, all scheduling decisions are made by the BPF component. Motivation ========== The primary goal of this scheduler is to act as a performance baseline for comparison with scx_rustland, allowing for a better assessment of the overhead caused by kernel/user-space interactions. It can also be used to deploy prototypes initially tested in the scx_rustland scheduler. In fact, this scheduler is expected to outperform scx_rustland, due to the elimitation of the kernel/user-space overhead. Scheduling policy ================= scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes interactive workloads. Its scheduling policy closely mirrors scx_rustland, but it has been re-implemented in BPF with some small adjustments. Tasks are categorized as either interactive or regular based on their average rate of voluntary context switches per second: tasks that exceed a specific voluntary context switch threshold are classified as interactive. Interactive tasks are prioritized in a higher-priority DSQ, while regular tasks are placed in a lower-priority DSQ. Within each queue, tasks are sorted based on their weighted runtime, using the built-in scx vtime ordering capabilities (scx_bpf_dispatch_vtime()). Moreover, each task gets a time slice budget. When a task is dispatched, it receives a time slice equivalent to the remaining unused portion of its previously allocated time slice (with a minimum threshold applied). This gives latency-sensitive workloads more chances to exceed their time slice when needed to perform short bursts of CPU activity without being interrupted (i.e., real-time audio encoding / decoding workloads). Results ======= According to the initial test results, using the same benchmark "playing a videogame while recompiling the kernel", this scheduler seems to provide a +5% improvement in the frames-per-second (fps) compared to scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2 and Baldur's Gate 3. Initial test results indicate that this scheduler offers around a +5% improvement in frames-per-second (fps) compared to scx_rustland when using the benchmark "playing a video game while recompiling the kernel". This improvement was observed in games such as Cyberpunk 2077, Counter-Strike 2, and Baldur's Gate 3. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Changwoo Min	f86d564d89	scx_lavd: fast path for ops.dispatch() when fully loaded When fully loaded so all CPUs are using, skip checking the cpumask. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-27 18:00:39 +09:00
David Vernet	fe3ce64a9b	Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load"	2024-06-26 17:35:22 -04:00
Changwoo Min	abc6e31fef	scx_lavd: for a forked task, inherit its parent's statistics The old approach was too conservative in running a new task, so when a fork-heavy workload competes with a CPU-bound workload, the fork-heavy one is starved. The new approach solves the starvation problem by inheriting parent's statistics. It seems a good (at least better than old) guess how a new task will behave. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 19:00:10 +09:00
Changwoo Min	ac9c49f5b5	scx_lavd: loosen the deadline when overloaded When the system is highly loaded with compute-intensive tasks, the old setting chokes latensive-intensive tasks, so loosen the dealine when the system is overloaded (> 100% utilization). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 15:06:31 +09:00
Changwoo Min	b32734168b	scx_lavd: print build ID when lavd is loaded When the lavd is loaded, it prints out its build id. It helps to easily identify what version it is when testing. ``` 01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu) ``` Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-26 10:57:19 +09:00
Dan Schatzberg	d349f86d04	mitosis: Update synchronization The synchronization for mitosis is a bit ad-hoc, working around lack of atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and compiler barriers to get the behaviors we want. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-25 12:44:16 -07:00
David Vernet	d42bae4fcf	rusty: Print build ID when rusty is loaded When someone is testing schedulers, we often have to ask what version the scheduler is running as. Now that we can access the build ID from rust schedulers, let's update scx_rusty to print the build ID when rusty first starts running. This results in output such as the following: ``` [void@maniforge scx]$ rusty 19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug) 19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111 19:04:26 [INFO] DOM[00] mask= 0b00000000111111110000000011111111 19:04:26 [INFO] DOM[01] mask= 0b11111111000000001111111100000000 19:04:26 [INFO] Rusty scheduler started! ``` Signed-off-by: David Vernet <void@manifault.com>	2024-06-25 11:44:46 -05:00
David Vernet	9d9ece11aa	Merge pull request #384 from jfernandez/log-recorder scx_utils: Add log_recorder module for metrics-rs	2024-06-25 11:43:37 -05:00
Changwoo Min	5d0db5c5fe	scx_lavd: revising tunables to reduce micro-stutters This is a second attempt to optimize tunables for a wider range of games. 1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range). Now the latency priority (biased by nice value) will decide which task should run first . The nice value will decide the time slice. 2) The first change will give higher priority to latency-critical task compared to before. For compensation, the slice boost also increased (2x -> 3x). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-25 16:13:32 +09:00
Jose Fernandez	e5984ed016	scx_utils: Add log_recorder module for metrics-rs This change adds a new module to the scx_utils crate that provides a log recorder for metrics-rs. The log recorder will log all metrics to the console at a configurable interval in an easy to read format. Each metric type will be displayed in a separate section. Indentation will be used to show the hierarchy of the metrics. This results in a more verbose output, but it is easier to read and understand. scx_rusty was updated to use the log recorder and all explicit metric logging was removed. Counters will show the total count and the rate of change per second. Counters with an additional label, like `type` in `dispatched_tasks_total` in rusty, will show the count, rate, and percentage of the total count. Counters: dispatched_tasks_total: 65559 [1344.8/s] prev_idle: 44963 (68.6%) [966.5/s] wsync_prev_idle: 15696 (23.9%) [317.3/s] direct_dispatch: 2833 (4.3%) [35.3/s] dsq: 1804 (2.8%) [21.3/s] wsync: 262 (0.4%) [4.3/s] direct_greedy: 1 (0.0%) [0.0/s] pinned: 0 (0.0%) [0.0/s] greedy_idle: 0 (0.0%) [0.0/s] greedy_xnuma: 0 (0.0%) [0.0/s] direct_greedy_far: 0 (0.0%) [0.0/s] greedy_local: 0 (0.0%) [0.0/s] dl_clamped_total: 1290 [20.3/s] dl_preset_total: 514 [1.0/s] kick_greedy_total: 6 [0.3/s] lb_data_errors_total: 0 [0.0/s] load_balance_total: 0 [0.0/s] repatriate_total: 0 [0.0/s] task_errors_total: 0 [0.0/s] Gauges will show the last set value: Gauges: slice_length_us: 20000.00 Histograms will show the average, min, and max. The histogram will be reset after each log interval to avoid memory leaks, since the data structure that holds the samples is unbounded. Histograms: cpu_busy_pct: avg=1.66 min=1.16 max=2.16 load_avg node=0: avg=0.31 min=0.23 max=0.39 load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39 processing_duration_us: avg=297.50 min=296.00 max=299.00 Signed-off-by: Jose Fernandez <josef@netflix.com>	2024-06-24 18:45:02 -06:00
David Vernet	8059acb634	Merge pull request #381 from vax-r/rusty_dom_load_status_check scx_rusty: Pull domain status check	2024-06-24 17:54:54 -05:00
David Vernet	55ee210d42	Merge pull request #382 from vax-r/rusty_refactor scx_rusty: Refactor ridx assignment in populate_tasks_by_load	2024-06-24 17:47:55 -05:00
Changwoo Min	016229cbcf	scx_lavd: revising tunables for less-preemptive games In some games (e.g., Elden Ring), it was observed that preemption happens much less frequently. The reason is that tasks' runtime per schedule is similar, so it does not meet the existing criteria. To alleviate the problem, the following three tunables are revised: 1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to trigger more preemption. 2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz kernels. 3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-24 00:27:33 +09:00
I Hsin Cheng	eab234a74f	scx_rusty: Refactor ridx assignment in populate_tasks_by_load Origin assignment of the variable ridx is equivalent to comparing between "ridx" and "wids - MAX_PIDS". Using u64 max library helper function to perform the comparison and provide better readability. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-23 21:58:51 +08:00
I Hsin Cheng	84b9ac4dce	scx_rusty: Pull domain status check Check whether the BalanceState of pull_dom.load inside function try_find_move_task is actually the variant NeedsPull. It'll perform task migration in abit more conservative manner when the system is under high loading situation. Experiments are performed when the system is compiling linux kernel and undergoing a large amount of I/O operation at the same time using fio. The result showns that before the modification, there're 12,6617 times of task migrations system wide. After the modification, there're 11,5419 times of task migrations system wide. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-23 21:38:23 +08:00
David Vernet	5038f54701	Merge pull request #377 from jfernandez/metrics-rs rusty: Integrate stats with the metrics framework	2024-06-21 15:23:20 -05:00
David Vernet	9919b71fd4	Merge pull request #379 from sched-ext/topo_nr_cpu_ids Add topo.nr_cpu_ids() to Topology crate	2024-06-21 13:35:05 -05:00
David Vernet	3bd15be840	rlfifo: Use topo.nr_cpu_ids() instead of topo.nr_cpus_possible() In scx_rlfifo, we're currently using topo.nr_cpus_possible() to determine how many possible CPU IDs we could have on the system. To properly support systems whose disabled CPUs may be in the middle of the range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that we don't accidentally dispatch to an invalid DSQ. Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:20 -05:00
David Vernet	263e02f644	rusty: Use nr_cpu_ids instead of nr_cpus_possible In scx_rusty, we're currently using topo.nr_cpus_possible() to determine how many possible CPU IDs we could have on the system. scx_rusty already accounts for offlined CPUs, so to properly support systems whose disabled CPUs may be in the middle of the range of possible CPU IDs, let's instead use topo.nr_cpu_ids(). Signed-off-by: David Vernet <void@manifault.com>	2024-06-21 12:57:19 -05:00

1 2 3 4 5 ...

759 Commits