scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-11-28 05:30:24 +00:00

Author	SHA1	Message	Date
Emil Tsalapatis	c545d23e79	factor enum handling into existing headers/operations	2024-11-06 07:03:40 -08:00
Andrea Righi	78101e4688	scx_bpfland: drop lowlatency mode and the priority DSQ Schedule all tasks using a single global DSQ. This gives a better control to prevent potential starvation conditions. With this change, scx_bpfland adopts a logic similar to scx_rusty and scx_lavd, prioritizing tasks based on the frequency of their wait and wake-up events, rather than relying exclusively on the average amount of voluntary context switches. Tasks are still classified as interactive / non-interactive based on the amount of voluntary context switches, but this is only affecting the cpufreq logic. Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-11-06 15:06:39 +01:00
Changwoo Min	d0eeebf98a	scx_lavd: deprioritize a long runtime by prioritizing frequencies further Since tasks' average runtimes show skewed distribution, directly using the runtime in the deadline calculation causes several performance regressions. Instead, let's use the constant factor and further prioritize frequency factors to deprioritize the long runtime tasks. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-06 18:08:57 +09:00
Changwoo Min	cfe23aa21b	scx_lavd: avoid self-IPI at preemption Revert the change of sending self-IPI at preemption when a victim CPU is the current CPU. The cost of self-IPI is prohibitively expensive in some workloads (e.g., perf bench). Instead, resetting task' time slice to zero. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-06 13:59:24 +09:00
Emil Tsalapatis	a1d0e7e638	autogenerate scx enum definitions	2024-11-05 13:52:25 -08:00
Emil Tsalapatis	31b9fb4135	set all enums in userspace before loading	2024-11-05 12:20:31 -08:00
Emil Tsalapatis	ff861d3e2c	introduce CO:RE enum readers and use them for scx_central	2024-11-05 08:29:45 -08:00
Andrea Righi	efc41dd936	scx_bpfland: strict domain affinity Rather then always migrating tasks across LLC domains when no idle CPU is available in their current LLC domain, allow migration but attempt to bring tasks back to their original LLC domain whenever possible. To do so, define the task's scheduling domain upon task creation or when its affinity changes, and ensure the task remains within this domain throughout its lifetime. In the future we will add a proper load balancing logic, but for now this change seems to provide consistent performance improvement in certain server workloads. For example, simple CUDA benchmarks show a performance boost of about +10-20% with this change applied (on multi-LLC / NUMA machines). Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-11-05 16:37:51 +01:00
Andrea Righi	064d6fb560	scx_bpfland: consider all tasks as regular if priority DSQ is congested This allows to prevent excessive starvation of regular tasks in presence of high amount of interactive tasks (e.g., when running stress tests, such as hackbench). Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-11-05 16:37:51 +01:00
Andrea Righi	8a655d94f5	scx_bpfland: do not overly prioritize WAKE_SYNC tasks This can lead to stalls when a high number of interactive tasks are running in the system (i.e.., hackbench or similar stress tests). Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-11-05 16:37:51 +01:00
Andrea Righi	f0c8de3477	scx_bpfland: do not exclude exiting tasks Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using bpf_task_from_pid() and the scheduler can handle exiting tasks. Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-11-05 16:37:51 +01:00
Andrea Righi	eb99e45ced	scx_bpfland: consistent vruntime update Ensure that task vruntime is always updated in ops.running() to maintain consistency with other schedulers. Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-11-05 16:37:51 +01:00
I Hsin Cheng	a5fd42a719	scx_rusty: Fix filtering logic error Fix task filtering logic error to avoid the possibility of migrate the same task over again. The orginal logic operation was "\|\|" which might include tasks already migrated to be taken into consideration again. Change the condition to "&&" so we can elimate the error. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-11-05 14:46:40 +08:00
I Hsin Cheng	a248fdc5e3	scx_rusty: Restore push domain tasks when no task found Inside the function "try_find_move_task()", it returns directly when there's no task found to be moved. If the cause is from lack of ability to fulfilled the condition by "task_filter()", load balancer will try to find move task again and remove "task_filter()" by setting it directly to a function returns true. However, in the fallback case, the tasks within the domains will be empty. Swap the tasks back into domains vector before returning can solve the issue. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-11-05 14:25:29 +08:00
Changwoo Min	e10a92f392	Merge pull request #887 from multics69/scx-minor-fixes scx_lavd: entirely drop kernel lock tracing	2024-11-05 00:19:02 +00:00
Emil Tsalapatis	9c6ad33fda	Merge pull request #891 from etsal/global_costc scx_layered: point costc to global struct when initializing budgets	2024-11-04 21:41:47 +00:00
Emil Tsalapatis	9cf137be99	scx_layered: remove ->open from layer struct	2024-11-04 11:42:23 -08:00
Emil Tsalapatis	2b0909f9d7	scx_layered: point costc to global struct when initializing budgets	2024-11-04 11:40:07 -08:00
Abdul Rehman	3b0b5e6f38	[bug-fix] * increasing the number of bits for cpumask	2024-11-04 10:54:29 -05:00
Daniel Hodges	5dcdcfc50f	scx_layered: Refactor cost naming Refactor cost struct usage so that it always is used as `costc` for clarity. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-11-04 07:07:13 -08:00
Changwoo Min	517ef89444	scx_lavd: drop all kernel lock tracing The combination of kernel versions and kerenl configs generates different kernel symbols. For example, in an old kernel version, __mutex_lock() is not generated. Also, there is no workaround from the fentry/fexit/kprobe side currently. Let's entirely drop the kernel locking for now and revisit it later. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-04 21:36:32 +09:00
Changwoo Min	796d324555	scx_lavd: improve readabilty Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-04 21:18:43 +09:00
Changwoo Min	882212574a	scx_lavd: fix a variable name to kill a warning Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-04 16:20:53 +09:00
Changwoo Min	e1b880f7c3	scx_lavd: fix CI error for missing kernel symbols Revised the lock tracking code, relying on stable symbols with various kernel configurations. There are two changes: - Entirely drop tracing rt_mutex, which can be on and off with kconfig - Replace mutex_lock() families to __mutex_lock(), which is stable across kernel configs. The downside of such change is it is now possible to trace the lock fast path, so lock tracing is a bit less accurate. But let's live with it for now until a better solution is found. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-04 16:05:53 +09:00
Andrea Righi	f6f5481081	Merge branch 'main' into bpfland-minor-cleanups	2024-11-02 15:28:37 +01:00
Daniel Hodges	7e0f2cd3f3	scx_layered Fix trace format Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-11-01 10:16:14 -07:00
Daniel Hodges	cb5b1961b8	Merge branch 'main' into layered-fallback-stall	2024-11-01 10:24:00 -04:00
Daniel Hodges	0a518e9f9e	scx_layered: Add additional drain to fallback DSQs Fallback DSQs are not accounted with costs. If a layer is saturating the machine it is possible to not consume from the fallback DSQ and stall the task. This introduces and additional consumption from the fallback DSQ when a layer runs out of budget. In addition, tasks that use partial CPU affinities should be placed into the fallback DSQ. This change was tested with stress-ng --cacheline `nproc` for several minutes without causing stalls (which would stall on main). Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-11-01 06:04:55 -07:00
Daniel Hodges	a8109d3341	scx_layered: Fix dump output format Flip the order of layer id vs layer name so that the output makes sense. Example output: LO_FALLBACK nr_queued=0 -0ms COST GLOBAL[0][random] budget=22000000000 capacity=22000000000 COST GLOBAL[1][hodgesd] budget=0 capacity=0 COST GLOBAL[2][stress-ng] budget=0 capacity=0 COST GLOBAL[3][normal] budget=0 capacity=0 COST CPU[0][0][random] budget=62500000000000 capacity=62500000000000 COST CPU[0][1][random] budget=100000000000000 capacity=100000000000000 COST CPU[0][2][random] budget=124911500964411 capacity=125000000000000 Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-11-01 04:06:33 -07:00
Changwoo Min	18a80977bb	scx_lavd: create DSQs on their associated NUMA nodes Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-01 11:18:22 +09:00
Changwoo Min	ce31d3c59e	scx_lavd: optimize consume_starving_task() Loop until the total number of DSQs not the theoretical maximum of DSQs. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-01 10:43:34 +09:00
Changwoo Min	673f80d3f7	scx_lavd: fix warnings Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-11-01 10:13:28 +09:00
Changwoo Min	4c3f1fd61c	Merge pull request #867 from vax-r/lavd_typo scx_lavd: Fix typos	2024-11-01 09:40:50 +09:00
Changwoo Min	93ec656916	Merge pull request #866 from vax-r/lavd_fix_type scx_lavd: Correct the type of taskc within lavd_dispatch()	2024-11-01 09:39:58 +09:00
Andrea Righi	628605cdee	scx_bpfland: get rid of the global dynamic nvcsw threshold The dynamic nvcsw threshold is not used anymore in the scheduler and it doesn't make sense to report it in the scheduler's statistics, so let's just drop it. Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-10-31 21:48:44 +01:00
Andrea Righi	827f6c6147	scx_bpfland: get rid of MAX_LATENCY_WEIGHT Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value specified by --nvcsw-max-thresh. This allows to tune the maximum latency weight when running in lowlatency mode (via --nvcsw-max-thresh) and it also restores the maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed during the lowlatency refactoring. Fixes: `4d68133` ("scx_bpfland: rework lowlatency mode to adjust tasks priority") Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-10-31 21:48:44 +01:00
Andrea Righi	72e9451c4a	scx_bpfland: evaluate nvcsw without using kernel metrics Evalute the amount of voluntary context switches directly in the BPF code, without relying on the kernel p->nvcsw metric. Signed-off-by: Andrea Righi <arighi@nvidia.com>	2024-10-31 21:48:44 +01:00
Daniel Hodges	6839a84926	scx_layered: Add layer CPU cost to dump Add the layer CPU cost when dumping. This is useful for understanding the per layer cost accounting when layered is stalled. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-10-31 10:45:33 -07:00
Daniel Hodges	9e42480a62	scx_layered: Add layer name to bpf Add the layer name to the bpf representation of a layer. When printing debug output print the layer name as well as the layer index. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-10-31 05:12:12 -07:00
I Hsin Cheng	3eed58cd26	scx_lavd: Fix typos Fix "infomation" to "information" Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-10-31 18:55:55 +08:00
I Hsin Cheng	f55cc965ac	scx_lavd: Correct the type of taskc within lavd_dispatch() The type of "taskc" within "lavd_dispatch()" was "struct task_struct ", while it should be "struct task_ctx ". Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-10-31 18:37:41 +08:00
Changwoo Min	83b5f4eb23	Merge pull request #861 from multics69/lavd-opt-dispatch scx_lavd: tuning and optimizing latency criticality calculation	2024-10-31 08:10:21 +09:00
Daniel Hodges	a8d245b164	scx_layered: Refactor dispatch Refactor dispatch to use a separate set of global helpers for topo aware dispatch. This change only refactors dispatch to make it more maintainable, without any functional changes. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-10-30 07:27:05 -07:00
Changwoo Min	82b25a94f4	scx_lavd: boost task's latency criticality when pinned to a cpu Pinning a task to a single CPU is a widely-used optimization to improve latency by reusing cache. So when a task is pinned to a single CPU, let's boost its latency criticality. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-30 15:28:18 +09:00
Changwoo Min	fe5554b83d	scx_lavd: move reset_lock_futex_boost() to ops.running() Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate, so move it to the running. This way, we can prevent the lock holder preemption only when a lock is acquired during ops.runnging() and ops.stopping(). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-30 13:52:52 +09:00
Changwoo Min	0f58a9cd39	scx_lavd: calculate task's latency criticality in the direct dispatch path Even in the direct dispatch path, calculating the task's latency criticality is still necessary since the latency criticality is used for the preemptablity test. This addressed the following GitHub issue: https://github.com/sched-ext/scx/issues/856 Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-30 12:58:06 +09:00
Changwoo Min	3dcaefcb2f	Merge pull request #854 from multics69/lavd-preempt scx_lavd: optimize preemption	2024-10-29 00:32:12 +00:00
Daniel Hodges	ad1bfee885	scx_layered: Add cost accounting Add cost accounting for layers to make weights work on the BPF side. This is done at both the CPU level as well as globally. When a CPU runs out of budget it acquires budget from the global context. If a layer runs out of global budgets then all budgets are reset. Weight handling is done by iterating over layers by their available budget. Layers budgets are proportional to their weights.	2024-10-28 13:09:04 -07:00
Changwoo Min	5b91a525bb	scx_lavd: kick CPU explicitly at the ops.enqueue() path When the current task is decided to yield, we should explicitly call scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time slice to zero is not sufficient in this because the sched_ext core does not call resched_curr() at the ops.enqueue() path. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-28 17:31:34 +09:00
Changwoo Min	f56b79b19c	scx_lavd: yield for preemption only when a task is ineligible An eligible task is unlikely preemptible. In other words, an ineligible task is more likely preemptible since its greedy ratio penalty in virtual deadline calculation. Hence, we skip the predictability test for an eligible task. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-28 12:45:50 +09:00
Jake Hillion	0f9c1a0a73	layered/timers: support verifying on older kernels and fix logic Some of the new timer code doesn't verify on older kernels like 6.9. Modify the code a little to get it verifying again. Also applies some small fixes to the logic. Error handling was a little off before and we were using the wrong key in lookups. Test plan: - CI	2024-10-25 11:31:00 +01:00
Changwoo Min	ea600d2f3b	Merge pull request #846 from multics69/lavd-issue-385 scx_lavd: fix uninitialized memory access at comp_preemption_info()	2024-10-25 01:47:20 +00:00
Pat Somaru	1e0e0d2f50	make timerlib work the best it can with tooling	2024-10-24 13:12:53 -04:00
Pat Somaru	8ab38559aa	fix lsp to work after multiarch support	2024-10-24 13:12:53 -04:00
Daniel Hodges	e38282d61a	scx_layered: Fix declarations in timer	2024-10-24 09:09:53 -07:00
Daniel Hodges	41a612f34d	scx_layered: Add monitor Add a monitor timer for scx_layered. For now the monitor is a noop.	2024-10-24 04:49:41 -04:00
Changwoo Min	4f6947736f	scx_lavd: fix uninitialized memory access comp_preemption_info() The previous code accesses uninitialized memory in comp_preemption_info() when called from can_task1_kick_task2() <-try_yield_current_cpu() to test if a task 2 is a lock holder or not. However, task2 is guaranteed not a lock holder in all its callers. So move the lock holder testing to can_cpu1_kick_cpu2(). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-24 16:07:53 +09:00
Changwoo Min	a13bb8028e	Merge pull request #837 from multics69/lavd-tuning-v4 scx_lavd: various optimizations for more consistent performance	2024-10-23 22:56:31 +00:00
Tejun Heo	cc8633996b	Revert "fix ci errors due to __str update in kfunc signature" This reverts commit `29918c03c8`.	2024-10-23 08:58:06 -10:00
Changwoo Min	b90ecd7e8f	scx_lavd: proactively kick a CPU at the ops.enqueue() path When a task is enqueued, kick an idle CPU in the chosen scheduling domain. This will reduce temporary stall time of the task by waking up the CPU as early as possible. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-23 21:43:11 +09:00
Changwoo Min	731a7871d7	scx_lavd: change the greedy penalty function We used to give a penalty in latency linearly to the greedy ratio. However, this impacts the greedy ratio too much in determining the virtual deadline, especially among under-utilized tasks (< 100.0%). Now, we treat all under-utilized tasks with the same greedy ratio (= 100.0%). For over-utilized tasks, we give a bit milder penalty to avoid sudden latency spikes. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-23 21:42:55 +09:00
Changwoo Min	9acf950b75	scx_lavd: change how to use the context information for latency criticality Previously, contextual information—such as sync wakeup and kernel task—was incorporated into the final latency criticality value ad hoc by adding a constant. Instead, let's make everything proportional to run time and waker and wakee frequencies by scaling up/down the run time and the frequencies. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-23 21:32:18 +09:00
Pat Somaru	29918c03c8	fix ci errors due to __str update in kfunc signature	2024-10-23 02:18:26 -04:00
Changwoo Min	fdca0c04ed	Merge pull request #831 from multics69/lavd-fix-bpf-veri scx_lavd: fix/work around a verifier error	2024-10-23 01:45:01 +09:00
Daniel Hodges	4898f5082a	scx_layered: Add timer helpers Add registry of timers and a helper for running timers. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-10-22 07:57:44 -07:00
Changwoo Min	6fb57643fb	scx_lavd: remove the time restriction in preemption Previously, the preemption is allowed only when a task is at the early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because the lock holder preemption can avoid harmful preemptions. So we remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and unleash the preemption. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-22 17:48:56 +09:00
Changwoo Min	07ed821511	scx_lavd: incorporate task's weight to latency criticality When calculating task's latency criticality, incorporate task's weight into runtime, wake_freq, and wait_freq more systematically. It looks nicer and works better under heavy load. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-22 17:48:56 +09:00
Changwoo Min	47dd1b9582	scx_lavd: respect a chosen cpu even if it is not idle Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-22 17:48:56 +09:00
Changwoo Min	257a3db376	scx_lavd: add ops.cpu_release() When a CPU is released to serve higher priority scheduler class, requeue the tasks in a local DSQ to the global enqueue. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-22 17:48:56 +09:00
Changwoo Min	89749ecad7	scx_lavd: fix/work around a verifier error Without this, the BPF verifier spits the following errors with some version of vmlinux.h. So added +1 to work around the problem. --------------- ; bpf_for(j, 0, 64) { @ main.bpf.c:1926 509: (bf) r1 = r8 ; R1_w=fp-32 R8_w=fp-32 refs=66,2035 510: (b4) w2 = 0 ; R2_w=0 refs=66,2035 511: (b4) w3 = 64 ; R3_w=64 refs=66,2035 512: (85) call bpf_iter_num_new#104189 ; R0=scalar() fp-32=iter_num(ref_id=2048,state=active,depth=0) refs=66,2035,2048 513: (bf) r1 = r8 ; R1=fp-32 R8=fp-32 refs=66,2035,2048 514: (85) call bpf_iter_num_next#104191 515: R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R6=scalar(id=2047,smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R7=scalar() R8=fp-32 R9=map_value(map=bpf_bpf.bss,ks=4,vs=4584,off=384,smin=smin32=0,smax=umax=smax32=umax32=3968,var_off=(0x0; 0xf80)) R10=fp0 fp-16=iter_num(ref_id=66,state=active,depth=1) fp-24=iter_num(ref_id=2035,state=active,depth=1) fp-32=iter_num(ref_id=2048,state=active,depth=1) fp-80=scalar(id=1) fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) fp-96=????0 fp-112=rcu_ptr_bpf_cpumask() fp-120=rcu_ptr_bpf_cpumask() fp-128=rcu_ptr_bpf_cpumask() fp-136=rcu_ptr_bpf_cpumask() refs=66,2035,2048 ; bpf_for(j, 0, 64) { @ main.bpf.c:1926 515: (15) if r0 == 0x0 goto pc+49 ; R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) refs=66,2035,2048 516: (64) w6 <<= 6 ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) refs=66,2035,2048 517: (61) r8 = (u32 )(r0 +0) ; R0=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R8_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) refs=66,2035,2048 518: (26) if w8 > 0x3f goto pc+46 ; R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048 ; if (cpumask & 0x1LLU << j) { @ main.bpf.c:1927 519: (bf) r1 = r7 ; R1_w=scalar(id=2053) R7=scalar(id=2053) refs=66,2035,2048 520: (7f) r1 >>= r8 ; R1_w=scalar() R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048 521: (57) r1 &= 1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=66,2035,2048 522: (15) if r1 == 0x0 goto pc+38 ; R1_w=1 refs=66,2035,2048 ; cpu = (i * 64) + j; @ main.bpf.c:1928 523: (4c) w8 \|= w6 ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048 ; bpf_cpumask_set_cpu(cpu, cd_cpumask); @ main.bpf.c:1929 524: (bc) w1 = w8 ; R1_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) R8_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048 525: (79) r2 = (u64 )(r10 -88) ; R2_w=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) R10=fp0 fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) refs=66,2035,2048 526: (85) call bpf_cpumask_set_cpu#93595 invalid access to map value, value_size=1320 off=1280 size=48 R2 max value is outside of the allowed memory range processed 24200 insns (limit 1000000) max_states_per_insn 19 total_states 961 peak_states 789 mark_read 44 --------------- Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-22 17:19:37 +09:00
Changwoo Min	d5b8aafa1a	Merge pull request #822 from multics69/lavd-tuning-v3 scx_lavd: misc performance tuning	2024-10-22 09:57:58 +09:00
Tejun Heo	6ea15f9f9f	Merge pull request #819 from minosfuture/vmlinux_per_arch Use per-arch vmlinux.h v2	2024-10-21 19:36:52 +00:00
likewhatevs	303c6d09a0	Merge pull request #824 from likewhatevs/layered-exit-task-no-missing-ctx scx_layered: fix exit_task ctx lookup err	2024-10-21 14:52:07 +00:00
Jake Hillion	55c9636f78	layered: bpf: add layer kind to layer Currently we have an approximation of LayerKind in the BPF code with `open` on the layer, but it is difficult/impossible to tell the difference between an Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb through an enum from the Rust side.	2024-10-21 11:32:17 +01:00
Changwoo Min	5f19fa0bab	scx_lavd: refill time slice once for a lock holder When a task holds a lock, refill its time slice once at the ops.dispatch() path to avoid the lock holder preemption problem. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-21 15:56:51 +09:00
Changwoo Min	5a852dc3d9	scx_lavd: direct dispatch when there is an idle CPU When there is an idle CPU, direct dispatch is performed to reduce scheduling latency. This didn't work well before, but it seems to work well now with other tunings. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-21 15:56:51 +09:00
Changwoo Min	420de70159	scx_lavd: give more penalty to long-running tasks Giving more penalties to a long-running tasks helps to segregate latency-critical tasks, which are usually short-running, to long-running tasks, which are compute-intensive. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-21 15:56:41 +09:00
Pat Somaru	d89c571593	scx_layered: do not attempt ctx lookup on tasks exited before running on scx	2024-10-20 17:47:24 -04:00
Andrea Righi	fb3f1d0b43	Merge pull request #821 from sched-ext/rustland-min-vtime-budget Some checks failed build-and-test / lint (push) Has been cancelled Details build-and-test / build-kernel (push) Has been cancelled Details build-and-test / pages (push) Has been cancelled Details build-and-test / integration-test (scx_bpfland) (push) Has been cancelled Details build-and-test / integration-test (scx_lavd) (push) Has been cancelled Details build-and-test / integration-test (scx_layered) (push) Has been cancelled Details build-and-test / integration-test (scx_rlfifo) (push) Has been cancelled Details build-and-test / integration-test (scx_rustland) (push) Has been cancelled Details build-and-test / integration-test (scx_rusty) (push) Has been cancelled Details build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Has been cancelled Details build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Has been cancelled Details build-and-test / rust-test-core (scx_loader) (push) Has been cancelled Details build-and-test / rust-test-core (scx_rustland_core) (push) Has been cancelled Details build-and-test / rust-test-core (scx_stats) (push) Has been cancelled Details build-and-test / rust-test-core (scx_utils) (push) Has been cancelled Details build-and-test / rust-test-schedulers (scx_bpfland) (push) Has been cancelled Details build-and-test / rust-test-schedulers (scx_lavd) (push) Has been cancelled Details build-and-test / rust-test-schedulers (scx_layered) (push) Has been cancelled Details build-and-test / rust-test-schedulers (scx_rlfifo) (push) Has been cancelled Details build-and-test / rust-test-schedulers (scx_rustland) (push) Has been cancelled Details build-and-test / rust-test-schedulers (scx_rusty) (push) Has been cancelled Details bpf-next-test / build-kernel (push) Has been cancelled Details bpf-next-test / integration-test (scx_bpfland) (push) Has been cancelled Details bpf-next-test / integration-test (scx_lavd) (push) Has been cancelled Details bpf-next-test / integration-test (scx_layered) (push) Has been cancelled Details bpf-next-test / integration-test (scx_rlfifo) (push) Has been cancelled Details bpf-next-test / integration-test (scx_rustland) (push) Has been cancelled Details bpf-next-test / integration-test (scx_rusty) (push) Has been cancelled Details scx_rustland: Adjust task's vruntime budget based on latency weight	2024-10-20 07:44:35 +00:00
Changwoo Min	bf1b014d63	Merge pull request #818 from multics69/lavd-tuning Some checks are pending build-and-test / lint (push) Waiting to run Details build-and-test / build-kernel (push) Waiting to run Details build-and-test / integration-test (scx_bpfland) (push) Blocked by required conditions Details build-and-test / integration-test (scx_lavd) (push) Blocked by required conditions Details build-and-test / integration-test (scx_layered) (push) Blocked by required conditions Details build-and-test / integration-test (scx_rlfifo) (push) Blocked by required conditions Details build-and-test / integration-test (scx_rustland) (push) Blocked by required conditions Details build-and-test / integration-test (scx_rusty) (push) Blocked by required conditions Details build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Blocked by required conditions Details build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_loader) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_rustland_core) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_stats) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_utils) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_bpfland) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_lavd) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_layered) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_rlfifo) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_rustland) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_rusty) (push) Blocked by required conditions Details build-and-test / pages (push) Waiting to run Details scx_lavd: add missing reset_lock_futex_boost()	2024-10-20 01:41:54 +00:00
Daniel Hodges	e72e5ce0f4	Merge pull request #744 from minosfuture/main Some checks are pending build-and-test / lint (push) Waiting to run Details build-and-test / build-kernel (push) Waiting to run Details build-and-test / integration-test (scx_bpfland) (push) Blocked by required conditions Details build-and-test / integration-test (scx_lavd) (push) Blocked by required conditions Details build-and-test / integration-test (scx_layered) (push) Blocked by required conditions Details build-and-test / integration-test (scx_rlfifo) (push) Blocked by required conditions Details build-and-test / integration-test (scx_rustland) (push) Blocked by required conditions Details build-and-test / integration-test (scx_rusty) (push) Blocked by required conditions Details build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Blocked by required conditions Details build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_loader) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_rustland_core) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_stats) (push) Blocked by required conditions Details build-and-test / rust-test-core (scx_utils) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_bpfland) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_lavd) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_layered) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_rlfifo) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_rustland) (push) Blocked by required conditions Details build-and-test / rust-test-schedulers (scx_rusty) (push) Blocked by required conditions Details build-and-test / pages (push) Waiting to run Details scx_layered: Fix crash on aarch64 due to unavailable cache id file	2024-10-19 22:33:53 +00:00
Ming Yang	1b5359ef4a	Use per-arch vmlinux.h v2 Rework per-arch vmlinux solution * have per-arch directory under sched/include/arch/, in which we maintain vmlinux.h symlink and real file vmlinux-{kernel_ver}-g{sha1}.h. The original sched/include/vmlinux/ folder is removed. * update meson build `-I` option to find the new vmlinux.h position * update cargo build scripts to use the per-arch vmlinux.h for generating bindings * keep the original ClangInfo refactoring changes Signed-off-by: Ming Yang <minos.future@gmail.com>	2024-10-19 10:50:59 -07:00
Andrea Righi	30a2a2013c	scx_rustland: Adjust task's vruntime budget based on latency weight Adjust the amount of vruntime budget an idle task can accumulate in function of its latency weight, which is derived from the average number of voluntary context switches. This ensures that latency-sensitive tasks naturally receive an additional priority boost and we can get avoid scaling down the vruntime to determine the task's deadline, making the scheduler more fair. It also makes the scheduler more robust, now rustland can survive intensive stress tests, such as `stress-ng --cpu-sched 64` or hackbench. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-10-19 19:32:14 +02:00
Daniel Hodges	b1b76ee72a	scx_rusty: Cleanup cpumask casting Use the cask_mask helper function to cleanup scx_rusty. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-10-19 12:01:36 -04:00
Changwoo Min	2fd395bbbf	scx_lavd: remove unnecessary load tracking The algorithm has been evolved to decide the time slice without tracking the system-wide load. So remove the obsolete load tracking code. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-19 15:39:24 +09:00
Changwoo Min	8d63024be7	scx_lavd: add missing reset_lock_futex_boost() reset_lock_futex_boost() should be called every context switch of a task. Otherwise, in the worst case, a task and that CPU could block the preemption. To avoid such a situation, add missing reset_lock_futex_boost() calls. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-10-19 15:39:18 +09:00
Ming Yang	f3f4726c09	scx_layered: Read CPU topology for building CpuPool Building CpuPool from cache-cpu topology did not apply on arm, because `/sys/devices/system/cpu/cpu{}/cache/index{}/id` file is unavailable. Read CPU topology instead. Signed-off-by: Ming Yang <minos.future@gmail.com>	2024-10-17 23:41:08 -07:00
Andrea Righi	48bbcd24dd	scx_bpfland: tune default settings Adjust some default settings after the rework done with commit 112a5d4 ("scx_bpfland: rework lowlatency mode to adjust tasks priority"). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-10-17 21:46:51 +02:00
Andrea Righi	4d68133f3b	scx_bpfland: rework lowlatency mode to adjust tasks priority Rework lowlatency mode as following: - introduce task dynamic priority: task weight multiplied by the average amount of voluntary context switches - use dynamic priority to determine task's vruntime (instead of the static task's weight) - task's minimum vruntime is evaluated in function of the dynamic priority (tasks with a higher dynamic priority can have a smaller vruntime compared to tasks with a lower dynamic priority) The dynamic priority allows to maintain a good system responsiveness also without applying the classification of tasks in "interactive" and "regular", therefore in lowlatency mode only the shared DSQ will be used (priority DSQ is disabled). Using a separate priority queue to dispatch "interactive" tasks makes the scheduler less fair, allowing latency-sensitive tasks to be prioritized even when there is a high number of tasks in the system (e.g., `stress-ng -c 1024` or similar scenarios), where relying solely on dynamic priority may not be sufficient. On the other hand, disabling the classification of "interactive" tasks results in a fairer scheduler and more predictable performance, making it better suited for soft real-time applications (e.g, audio and multimedia). Therefore, the --lowlatency option is retained to allow users to choose between more predictable performance (by disabling the interactive task classification) or a more responsive system (default). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-10-17 21:46:51 +02:00
Andrea Righi	d336892c71	Merge pull request #816 from sched-ext/rustland-core-update-doc scx_rustland_core: update documentation about the new API	2024-10-17 19:18:16 +00:00
Andrea Righi	a155ff2ada	scx_rustland_core: update documentation about the new API Update the documentation adding the new task statistics provided by scx_rustland_core. Fixes: `be681c7` ("scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-10-17 19:07:51 +02:00
Jake Hillion	f1b1830512	Merge pull request #814 from JakeHillion/pr814 layered: add RandomTopo layer growth algorithm	2024-10-17 17:05:53 +00:00
Jake Hillion	1415b4a454	layered: make disable_topology arg require equals The recent changes to `disable_topology` making the arg an `Option<bool>` instead of a `bool` caused an issue with it incorrectly attaching arguments. Make the argument `require_equals` to fix this case. This is a behaviour change for anybody previously relying on `-t true`, `-t false`, `--disable-topology true`, or `--disable-topology false`. The equals syntax worked before and continues to work after, as demonstrated in the CI. Test plan: Before: ```sh $ sudo target/release/scx_layered -t f:/tmp/test.json error: invalid value 'f:/tmp/test.json' for '--disable-topology [<DISABLE_TOPOLOGY>]' [possible values: true, false] For more information, try '--help'. ``` After: ```sh $ sudo target/release/scx_layered -t f:/tmp/test.json 14:44:00 [INFO] CPUs: online/possible=176/176 nr_cores=88 14:44:00 [INFO] Disabling topology awareness ... ^CEXIT: Scheduler unregistered from user space ```	2024-10-17 15:46:30 +01:00
Jake Hillion	a0fe303b61	layered: add RandomTopo layer growth algorithm Add an additional layer growth algorithm, named 'RandomTopo'. It follows these rules: - Randomise NUMA nodes. List each core in each NUMA node before a core from another NUMA node. - Randomise LLCs within each NUMA node. List each core in each LLC before a core in a different LLC. - Randomise the core order within each LLC. This attempts to provide a relatively evenly distributed set of cores while considering topology. Unlike `Topo`, it does not require you to specify the ordering and instead generates it from the hardware, making desyncs between the config and the hardware less likely. Currently `RandomTopo` considers topology even with `--disable-topology=true`. I can see the arguments for this going both ways. On one hand requesting disable topology suggests you want no consideration of machine topology, and `RandomTopo` should decay to `Random` (which it does on single node/LLC machines anyway). On the other hand, the config explicitly specifies `RandomTopo` and should consider the topology. If anyone feels strongly I can change this to respect `disable_topology`. Test plan: ```sh $ sudo target/release/scx_layered -v f:/tmp/test.json ... 14:31:19 [DEBUG] layer: batch algo: RandomTopo core order: [47, 44, 43, 42, 40, 45, 46, 41, 38, 37, 36, 39, 34, 32, 35, 33, 54, 49, 50, 52, 51, 48, 55, 53, 68, 64, 66, 67, 70, 69, 71, 65, 9, 10, 12, 15, 14, 11, 8, 13, 59, 60, 57, 63, 62, 56, 58, 61, 2, 3, 5, 4, 0, 6, 7, 1, 86, 83, 85, 87, 84, 81, 80, 82, 20, 22, 19, 23, 21, 18, 17, 16, 30, 25, 26, 31, 28, 27, 29, 24, 78, 73, 74, 79, 75, 77, 76, 72] 14:31:19 [DEBUG] layer: immediate algo: RandomTopo core order: [45, 40, 46, 42, 47, 43, 41, 44, 80, 82, 83, 84, 85, 86, 81, 87, 13, 10, 9, 15, 14, 12, 11, 8, 36, 38, 39, 32, 34, 35, 33, 37, 7, 3, 1, 0, 2, 5, 4, 6, 53, 52, 54, 48, 50, 49, 55, 51, 76, 77, 79, 78, 73, 74, 72, 75, 71, 66, 64, 67, 70, 69, 65, 68, 24, 26, 31, 25, 28, 30, 27, 29, 58, 56, 59, 61, 57, 62, 60, 63, 16, 19, 17, 23, 22, 20, 18, 21] ... ``` This is a machine with 1 NUMA/11 LLCs with 8 cores per LLC and you can see the results are grouped by LLC but random within.	2024-10-17 15:36:00 +01:00
Daniel Hodges	b01ff79080	Merge pull request #805 from hodgesds/layered-refresh-cleanup scx_layered: Refactor refresh cpumasks	2024-10-16 19:06:15 +00:00
Andrea Righi	2ea47af4bc	Merge pull request #804 from sched-ext/rustland-fixes scx_rustland fixes and improvements	2024-10-16 18:26:03 +00:00
Tejun Heo	84d8abf913	Revert "Use per-arch vmlinux.h" This reverts commit `a23f3566e3`.	2024-10-16 06:42:28 -10:00
Tejun Heo	bd79059f1a	Revert "Add vmlinux.h for multiple arch" This reverts commit `7067092555`.	2024-10-16 06:42:18 -10:00
Dan Schatzberg	730052a0c4	Merge pull request #803 from dschatzberg/mitosis_fallback_dsq scx_mitosis: Handle pinned tasks	2024-10-16 13:26:23 +00:00
Andrea Righi	763da6ab55	scx_rlfifo: operate in a more work-conserving way Make scx_rlfifo even simpler and keep dispatching tasks even if the CPUs are all busy. This allows to better stress test the scx_rustland_core backend, by using both the per-CPU DSQs and the global shared DSQ. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-10-16 14:06:00 +02:00

1 2 3 4 5 ...

1291 Commits