JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-11-28 04:10:23 +00:00

Author	SHA1	Message	Date
Dan Schatzberg	7f9548eb34	scx_layered: Add support for OpenMetrics format Currently scx_layered outputs statistics periodically as info! logs. The format of this is largely unstructured and mostly suitable for running scx_layered interactively (e.g. observing its behavior on the command line or via logs after the fact). In order to run scx_layered at larger scale, it's desireable to have statistics output in some format that is amenable to being ingested into monitoring databases (e.g. Prometheseus). This allows collection of stats across many machines. This commit adds a command line flag (-o) that outputs statistics to stdout in OpenMetrics format instead of the normal log mechanism. OpenMetrics has a public format specification (https://github.com/OpenObservability/OpenMetrics) and is in use by many projects. The library for producing OpenMetrics metrics is lightweight but does induce some changes. Primarily, metrics need to be pre-registered (see OpenMetricsStats::new()). Without -o, the output looks as before, for example: ``` 19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:39:54 [INFO] Layered Scheduler Attached 19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms 19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1 19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458 19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000 19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556 19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms 19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1 19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589 19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000 19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff 19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650 19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff ^C19:39:59 [INFO] EXIT: BPF scheduler unregistered ``` With -o passed, the output is in OpenMetrics format: ``` 19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:40:08 [INFO] Layered Scheduler Attached # HELP total Total scheduling events in the period. # TYPE total gauge total 8489 # HELP local % that got scheduled directly into an idle CPU. # TYPE local gauge local 86.45305689716104 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs. # TYPE open_idle gauge open_idle 0.0 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions. # TYPE affn_viol gauge affn_viol 2.332430203793144 # HELP tctx_err Failures to free task contexts. # TYPE tctx_err gauge tctx_err 0 # HELP proc_ms CPU time this binary has consumed during the period. # TYPE proc_ms gauge proc_ms 20 # HELP busy CPU busy % (100% means all CPUs were fully occupied). # TYPE busy gauge busy 0.5294061026085283 # HELP util CPU utilization % (100% means one CPU was fully occupied). # TYPE util gauge util 27.37195512782239 # HELP load Sum of weight * duty_cycle for all tasks. # TYPE load gauge load 81.55024768702126 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied). # TYPE layer_util gauge layer_util{layer_name="immediate"} 0.0 layer_util{layer_name="normal"} 19.340849995024997 layer_util{layer_name="batch"} 8.031105132797393 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer. # TYPE layer_util_frac gauge layer_util_frac{layer_name="batch"} 29.34063385422595 layer_util_frac{layer_name="immediate"} 0.0 layer_util_frac{layer_name="normal"} 70.65936614577405 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer. # TYPE layer_load gauge layer_load{layer_name="immediate"} 0.0 layer_load{layer_name="normal"} 11.14363313258934 layer_load{layer_name="batch"} 70.40661455443191 # HELP layer_load_frac Fraction of total load consumed by the layer. # TYPE layer_load_frac gauge layer_load_frac{layer_name="normal"} 13.664744680306903 layer_load_frac{layer_name="immediate"} 0.0 layer_load_frac{layer_name="batch"} 86.33525531969309 # HELP layer_tasks Number of tasks in the layer. # TYPE layer_tasks gauge layer_tasks{layer_name="immediate"} 0 layer_tasks{layer_name="normal"} 490 layer_tasks{layer_name="batch"} 343 # HELP layer_total Number of scheduling events in the layer. # TYPE layer_total gauge layer_total{layer_name="normal"} 6711 layer_total{layer_name="batch"} 1778 layer_total{layer_name="immediate"} 0 # HELP layer_local % of scheduling events directly into an idle CPU. # TYPE layer_local gauge layer_local{layer_name="batch"} 69.79752530933632 layer_local{layer_name="immediate"} 0.0 layer_local{layer_name="normal"} 90.86574281031143 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers. # TYPE layer_open_idle gauge layer_open_idle{layer_name="immediate"} 0.0 layer_open_idle{layer_name="batch"} 0.0 layer_open_idle{layer_name="normal"} 0.0 # HELP layer_preempt % of scheduling events that preempted other tasks. # # TYPE layer_preempt gauge layer_preempt{layer_name="normal"} 0.0 layer_preempt{layer_name="batch"} 0.0 layer_preempt{layer_name="immediate"} 0.0 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions. # TYPE layer_affn_viol gauge layer_affn_viol{layer_name="normal"} 2.950379973178364 layer_affn_viol{layer_name="batch"} 0.0 layer_affn_viol{layer_name="immediate"} 0.0 # HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer. # TYPE layer_cur_nr_cpus gauge layer_cur_nr_cpus{layer_name="normal"} 50 layer_cur_nr_cpus{layer_name="batch"} 2 layer_cur_nr_cpus{layer_name="immediate"} 50 # HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer. # TYPE layer_min_nr_cpus gauge layer_min_nr_cpus{layer_name="normal"} 0 layer_min_nr_cpus{layer_name="batch"} 0 layer_min_nr_cpus{layer_name="immediate"} 0 # HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer. # TYPE layer_max_nr_cpus gauge layer_max_nr_cpus{layer_name="immediate"} 50 layer_max_nr_cpus{layer_name="normal"} 50 layer_max_nr_cpus{layer_name="batch"} 2 # EOF ^C19:40:11 [INFO] EXIT: BPF scheduler unregistered ``` Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-25 09:59:49 -08:00
Jordan Rome	9f9a97a97f	Update descriptions in cargo toml files	2024-01-19 18:19:46 -08:00
Andrea Righi	24ef0f6c00	Merge pull request #94 from sched-ext/scx-rustland-smt-improvements scx-rustland: SMT improvements	2024-01-17 21:01:26 +01:00
Andrea Righi	be1cb8774b	scx_rustland: improve SMT performance The user-space scheduler dispatches tasks in batches, with the batch size matching the number of idle CPUs. Commit `791bdbe` ("scx_rustland: introduce SMT support") changed the order of idle CPUs, prioritizing dispatching tasks on the least busy cores (those with the most idle CPUs) before moving on to busier cores (those with the least idle CPUs). While this approach works well for a small number of tasks, it can lead to uneven performance as the number of tasks increases and all cores are saturated. Such uneven performance can be attributed to SMT interactions causing potential short lags and erratic system performance. In some cases, disabling SMT entirely results in better system responsiveness. To address this issue, instruct the scheduler to implicitly disable SMT and consistently dispatch tasks only on the first (or last) CPU of each core. This approach ensures an equal distribution of tasks among the available cores, preventing SMT disturbances and aligning with non-SMT performance, also when a significant amount of tasks are running. Additionally, the unused sibling CPUs within each core can be used as "spare" CPUs for the BPF dispatcher. This is particularly beneficial for tasks that cannot be dispatched on the target CPU selected by the scheduler, due to cpumask restrictions or congestion conditions. Therefore, this new approach allows to enhance system responsiveness on SMT systems, while simultaneously improving scheduler stability. Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled): running my usual benchmark of measuring the fps of a videogame (Counter-Strike 2) during a parallel kernel build-induced system overload, shows an improvement of approximately 2x (from 8-10fps to 15-25fps vs 1-2fps with EEVDF). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	f0c33320ab	scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle() Prior to commit `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ"), the user-space scheduler was dispatched using SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from update_idle() to ensure that at least one CPU was available to run the user-space scheduler. Now that we are using SCX_DSQ_LOCAL_ON\|cpu to dispatch the user-space scheduler, the target CPU is implicitly kicked. Therefore, the call to scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can get rid of it. Fixes: `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Tejun Heo	9089cc09bb	Merge pull request #92 from sched-ext/nest_callbacks scx_nest: Set timer callback after cancelling	2024-01-17 09:27:22 -10:00
David Vernet	7a3fe759f2	scx_nest: Remove -D option for eager compaction Now that scheduling BPF timers works correctly, we don't need this extra logic to eagerly compact if a scheduling for compaction has happened a few times in a row. Let's remove it. Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:08:36 -06:00
David Vernet	607119d8a4	scx_nest: Set timer callback after cancelling In scx_nest, we use a per-cpu BPF timer to schedule compaction for a primary core before it goes idle. If a task comes along that could use that core, we cancel the callback with bpf_timer_cancel(). bpf_timer_cancel() drops a refcnt on the prog and nullifies the callback, so if we want to schedule the callback again, we must use bpf_timer_set_callback() to reset the prog. This patch does that. Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:01:39 -06:00
Andrea Righi	0b3c399519	scx_rustland: introduce dynamic slice boost Update the slice boost dynamically, as a function of the amount of CPUs in the system and the amount of tasks currently waiting to be dispatched: as the amount of waiting tasks in the task_pool increases, reduce the slice boost. This adjustment ensures that the scheduler adheres more closely to a pure vruntime-based policy as the amount of tasks contending the available CPUs increases and it allows to sustain stress tests that are spawning a massive amount of tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:51:51 +01:00
Andrea Righi	791bdbec97	scx_rustland: introduce SMT support Introduce a basic support of CPU topology awareness. With this change, the scheduler will prioritize dispatching tasks to idle CPUs with fewer busy SMT siblings, then, it will proceed to CPUs with more busy SMT siblings, in ascending order. To implement this, introduce a new CoreMapping abstraction, that provides a mapping of the available core IDs in the system along with their corresponding lists of CPU IDs. This, coupled with the get_cpu_pid() method from the BpfScheduler abstraction, allows the user-space scheduler to enforce the policy outlined above and improve performance on SMT systems. Keep in mind that this improvement is relevent only when the amount of tasks running in the system is less than the amount of CPUs. As soon as the amount of running tasks increases, they will be distributed across all available CPUs and cores, thereby negating the advantages of SMT isolation. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:33:35 +01:00
Andrea Righi	63209b865d	scx_rustland: support aligned allocations in RustLandAllocator Even if the current implementation of the user-space scheduler doesn't require to allocate aligned memory, add a simple support to aligned allocations in RustLandAllocator, in order to make it more generic and potentially usable by other schedulers / components. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-15 13:44:33 +01:00
Andrea Righi	c593e3605e	scx_rustland: report user-space scheduler page fault counter Periodically report a page fault counter in the scheduler output. The user-space scheduler should never trigger page faults, otherwise we may experience deadlocks (that would trigger the sched-ext watchdog, unloading the scheduler). Reporting a page fault counter periodically to stdout can be really helpful to debug potential issues with the custom allocator. Moreover, group together also nr_sched_congested and nr_failed_dispatches with nr_page_faults and use the sum of all these counters to determine the healthy status of the user-space scheduler (reporting it to stdout as well). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Andrea Righi	9708a80130	scx_userland: use a custom memory allocator to prevent page faults To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults. To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer. This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition. This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported. In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it. This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free). This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations. With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop): $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland` PID RSS ELAPSED CMD 34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Andrea Righi	acc1d51560	scx_rustland: remove obsolete TODO note Entries from TaskInfoMap associated to exiting tasks are already removed via the BPF .exit_task() callback, so drop the obsolete TODO note and replace it with a proper comment. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 20:47:36 +01:00
Andrea Righi	12d89e1d84	scx_rustland: add a troubleshooting section Add a brief troubleshooting section to the command line help. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:46 +01:00
Andrea Righi	2157f638df	scx_rustland: voluntary context switch boost Improve priority boosting using voluntary context switches metric. Overview ======== The current criteria to apply the time slice boost (option `-b`) is to distinguish between newly created tasks and tasks that are already running: in order to prioritize interactive applications (games, multimedia, etc.) we apply a time slice usage penalty on newly created tasks, indirectly boosting the priority of tasks that are already running, which are likely to be the interactive applications that we aim to prioritize. Problem ======= This approach works well when the background workload forks a bunch of short-lived tasks (e.g., a parallel kernel build), but it fails to properly classify CPU-intensive background tasks (i.e., video/3D rendering, encryption, large data analysis, etc.), because these applications, typically, do not generate many short-lived processes. In presence of such workloads the time slice penalty is not enforced, resulting in a lack of any boost for interactive applications. Solution ======== A more effective critiria for distinguishing between interactive applications and background CPU-intensive applications is to examine the voluntary context switches: an application that periodically releases the CPU voluntarily is very likely to be interactive. Therefore, change the time slice boost logic to apply a bonus (scale down the accounted used time slice) to tasks that show an increase in their voluntary context switches counter over a time frame of 10 sec. Based on experimental results, this simple heurstic appears to be quite effective in classifying interactive tasks and prioritize them over potential background CPU-intensive tasks. Additionally, having a better criteria to identify interactive tasks allow to prioritize also newly created tasks, thereby enhancing the responsiveness of interactive shell sessions. This always ensures the prompt execution of system commands, even when the system is massively overloaded, unlike the previous time slice boost logic, which made interactive shell sessions less responsive by deprioritizing newly created tasks. Results ======= With this new logic in place it is possible to play a video game (e.g., Terraria) without experiencing any frame rate drop (60 fps), while a parallel CPU stress test (`stress-ng -c 32`) is running in the background. The same result can also be obtained with a parallel kernel build (`make -j 32`). Thus, there is no regression compared to the previous "ideal" test case. Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`), Terraria can still be played without noticeable lag in the audio or video, maintaining a consistent 60 fps. In addition to that, shell commands are also very responsive. Following, the results (average and standard deviation of 10 runs) of two simple interactive shell commands, while both the `make -j 16` and `stress-ng -c 16` workloads are running in background: avg time "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 11.1ms 231.8ms scx_rustland 2.6ms 212.0ms stdev "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 2.28 23.41 scx_rustland 0.70 9.11 Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @ 4.800GHz) with 16GB of RAM. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:30 +01:00
Andrea Righi	1cf03770c7	scx_rustland: expose voluntary context switches to the scheduler Provide the number of voluntary context switches (nvcsw) for each task to the user-space scheduler. This extra information can then be used by the scheduler to enhance its decision-making process when scheduling tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 14:10:39 +01:00
Tejun Heo	1395f14975	Update README.md Embed the video and drop "live" from section title as it's not really live.	2024-01-10 14:47:33 -10:00
Tejun Heo	18f7fe8477	scx_flatcg: Fix fallout from direct dispatch API update `552b75a9c7` ("scx: Build fix after kernel update") updated scx_flatcg along with other schedulers to use the new direct dispatching from ops.select_cpu() mechanism. However, this was buggy for flatcg. flatcg uses direct dispatch for two purposes - as an optimization when there are idle cpus and to avoid dealing with custom CPU affinities in the dispatch logic. While the former can be moved to ops.select_cpu(), the latter can't as it should also apply to tasks which get enqueued without preceding ops.select_cpu(), e.g., when the task gets requeued after an attribute change or runs out of time slice. The API update incorrectly moved both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup() and scheduling misbheaviors. Fix it by separating out the two cases and only keeping the idle optimization case in ops.select_cpu(). Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-10 10:57:50 -10:00
Tejun Heo	c1f22ea073	scx_flatcg: Report pick_next_cgroup() race and fail counts To improve visibility into failure mode. While at it, improve output formatting. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-10 10:52:24 -10:00
Tejun Heo	ae50b155ca	Merge pull request #80 from sched-ext/scx-flatcg-mitigate-stall scx_flatcg: introduce CGROUP_MAX_RETRIES	2024-01-10 09:49:09 -10:00
Andrea Righi	0609abdca6	scx_flatcg: introduce CGROUP_MAX_RETRIES We may end up stalling for too long in fcg_dispatch() if try_pick_next_cgroup() doesn't find another valid cgroup to pick. This can be quite risky, considering that we are holding the rq lock in dispatch(). This condition can be reproduced easily in our CI, where we can trigger stalling softirq works: [ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!! Or rcu stalls: [ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0 [ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0 [ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4) [ 47.731900] Sending NMI from CPU 0 to CPUs 1: To mitigate this issue reduce the amount of try_pick_next_cgroup() retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:36:17 +01:00
Andrea Righi	0198d893ce	scx_rustland: introduce time slice boost parameter Introduce a parameter to prioritize active running tasks over newly created tasks. This option can be used to enhance interactive applications (e.g., games, audio/video, GUIs, etc.) that are concurrently running with fork-intensive background workloads (such as a large parallel build for example). The boost value (which functions as a penalty) is applied to the time slice attributed to newly generated tasks, increasing their vruntime and, in an indirect manner, "boosting" the priority of all the other concurrent active tasks. The time slice boost parameter was applied in the live demo video [1] to enhance the frames per second (fps) of a video game (Terraria), running simultaneously with a parallel kernel build (`make -j 32`) on an 8-core laptop (the value used in the video matches the existing setting of running `scx_rustland -b 200`). [1] https://www.youtube.com/watch?v=oCfVbz9jvVQ Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:32:29 +01:00
Andrea Righi	732ba4900b	scx_rustland: avoid using SCX_ENQ_PREEMPT With the introduction of a the dynamic time slice that scales down based on the number of tasks in the system, there is no obvious benefit in utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler. The reduced time slice as the task count increases already enhances the user-space scheduler's opportunities to run and efficiently manage scheduling tasks, even when the system is massively overloaded. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:32:29 +01:00
Andrea Righi	db9a29d618	scx_rustland: improve dynamic slice scaling Move scaling after tasks are sent to the dispatcher: tasks are dispatched based on the amount of idle CPUs, so checking for any remaining tasks still sitting in the scheduler after dispatch gives a better idea how busy the system is. Moreover, do not scale the time slice based on nr_cpus (otherwise, systems with a large amount of CPUs would rarely get any scaling at all). Instead, apply a scaling factor as a function of how many tasks are still waiting in the scheduler: nr_scheduled / 2. This method scales better as the number of CPUs increases. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Andrea Righi	1da2983804	scx_rustland: get rid of force_local Now that we can dispatch directly from select_cpu() we can make the code more compact and readable by removing the force_local logic. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Andrea Righi	6ead675fb6	scx_rustland: add a link to the live demo in the README Update the README.md adding a link to a live demo video of the scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Tejun Heo	942b0269b8	Bump versions After updates to reflect the updated init and direct dispatch API, the schedulers aren't compatible with older kernels. Bump versions and publish releases.	2024-01-08 18:49:54 -10:00
Tejun Heo	552b75a9c7	scx: Build fix after kernel update In the latest kernel, sched_ext API has changed in two areas: - ops.prep_enable/cancel_enable/enable/disable() replaced with ops.init_task/enable/disable/exit_task(). - scx_bpf_dispatch() can now be called from ops.select_cpu(). Also, SCX_ENQ_LOCAL flag is removed. Instead, users can call scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out param value to determine whether to dispatch directly. This commit updates all schedules so that they build. - Init functions renamed / merged / split. - ops.select_cpu() is added to several schedulers and local direct disptching logic is moved there. This is the minimum update which is need to make the schedulers build and work. It needs further update to e.g. move vtime udpates to ops.enable().	2024-01-08 14:48:24 -10:00
Andrea Righi	1ea5aebfb4	scx_rustland: always consider slice_ns as maximum time slice With the introduction of a the dynamic time slice that scales down based on the number of tasks in the system, there is no need anymore to apply a constant scaling factor to time slice to extend the range of the allowed time slices. Therefore, get rid of the static scaling and use slice_ns as the upper limit for the time slice accounted to the tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 19:22:38 +01:00
Andrea Righi	9b482f48f1	scx_rustland: determine the amount of cores via /proc/stat libbpf_rs::num_possible_cpus() may take into account multi-threads multi-cores information, that are not used efficiently by the scheduler at the moment. For simplicity rely on /proc/stat to determine the amount of CPUs that can be used by the scheduler and provide a proper abstraction to access this information from the bpf Rust module. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 19:11:25 +01:00
Andrea Righi	0d107d6220	scx_rustland: return the proper cpu value from get_task_cpu() Fix the ternary operator expression to return the CPU id, instead of the boolean result of the condition. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 19:10:59 +01:00
Andrea Righi	fa6915cc0a	scx_rustland: simplify update_enqueued() With the introduction of a variable time slice that scales down in function of the amount of waiting tasks, the scheduler is able to handle a steady stream of newly spawned tasks, without having to de-prioritize them to guarantee a good level of system responsiveness. Hence, the logic for de-prioritizing new tasks can be removed, as it currently doesn't provide any measurable benefits. In fact, it even proves counterproductive as it can implicitly slow down the interactive performance of shell sessions when the system is overloaded with a significant amount of CPU hogs (e.g, `stress-ng -c 128`). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 07:38:52 +01:00
Andrea Righi	bf98154ee1	scx_rustland: use dynamic time slice in the user-space scheduler Implement a simple logic in the user-space scheduler to automatically adjust the tasks' time slice: reduce the time slice by a scaling factor of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks waiting in the scheduler and nr_cpus is the amount of CPUs in the system. Using a fine-grained time slice as the number of tasks in the system grows, improves responsiveness of low-latency activities (e.g., audio, video games), also in presence of other CPU-intensive tasks that are concurrently running in the system. On the other hand, extending the time slice when only a limited number of tasks are active in the system contributes to an enhancement in the overall system throughput and a reduced amount of context switches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 07:38:52 +01:00
Andrea Righi	303c4ea548	scx_rustland: dynamic time slice support Add to BpfScheduler() the new methods set_effective_slice_us() and get_effective_slice_us(). These methods can be used by the user-space scheduler to dynamically adjust (and retrieve) the effective time slice used to dispatch tasks within the BPF dispatcher. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 07:35:31 +01:00
Andrea Righi	2a32d81859	scx_rustland: store default slice_ns in the scheduler class Cache slice_ns into the main scheduler class to avoid accessing it via self.bpf.skel.rodata().slice_ns every single time. This also makes the scheduler code more clear and more abstracted from the BPF details. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-07 16:14:51 +01:00
Andrea Righi	8ccbbdadee	scx_userland: improve BPF logging Always report task comm, nr_queued and nr_scheduled in the log messages. Moreover, report also task name (comm) and cpu when possible. All these extra information can be really helpful to trace and debug scheduling issues. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-07 16:14:51 +01:00
Andrea Righi	295873ac41	scx_rustland: always dispatch per-CPU kthreads from enqueue We allow tasks to bypass the user-space scheduler and be dispatched directly using a shortcut in the enqueue path, if their running CPU is immediately available or if the task is per-CPU kthread. However, the shortcut is disabled if the user-space scheduler has some pending activities to do (to avoid disrupting too much its decision). In this case the shortcut is disabled also for per-CPU kthreads and that may cause priority-inversion problems in the system, triggering some stall of some per-CPU kthreads (such as rcuog/N) and short system lockups, if the system is overloaded. Prevent this by always enabing the dispatch shortcut for per-CPU kthreads. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	0c3bdb16fe	scx_rustland: prevent using SCX_DSQ_LOCAL_ON from enqueue() When we fail to push a task to the queued BPF map we fallback to direct dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use SCX_DSQ_GLOBAL in this case to prevent scheduler crashes. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	05d997c539	scx_rustland: more robust CPU selection logic in the dispatcher Instead of just trying the target CPU and the previously used CPU, we could cycle among all the available CPUs (if both those CPUs cannot be used), before using the global DSQ. This allows to not de-prioritize too much tasks that can't be scheduled on the CPU selected by the scheduler (or their previously used CPU), and we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	18a990ae82	scx_rustland: assign min_vruntime before time slice evaluation Assign min_vruntime to the task before the weighted time slice is evaluated, then add the time slice. In this way we still ensure that the task's vruntime is in the range (min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify the effect of the evaluated time slice if the starting vruntime of the task is too small. Also change update_enqueued() to return the evaluated weighted time slice (that can be used in the future). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	92109c95a9	scx_rustland: small TaskTree.push() refactoring Change TaskTree.push() to accept directly a Task object, rather than each individual attribute. Moreover, Task attributes don't need to be public, since both TaskTree and Task are only used locally. This makes the code more elegant and more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Jordan Rome	661ea57c5c	bump scx_rusty and scx_layered These were supposed to be bumped in this commit: `fed1dae9da`	2024-01-04 13:57:29 -08:00
Andrea Righi	96f3eb42be	Merge pull request #68 from sched-ext/scx-rustland-refactoring scx_rustland: refactoring	2024-01-04 20:42:30 +01:00
Andrea Righi	7813992896	scx_rustland: introduce nr_failed_dispatches Introduce a new counter to report the amount of failed dispatches: if the scheduler designates a target CPU for a task, and both the chosen CPU and the previously utilized one are unavailable when the task is dispatched, the task will be sent to the global DSQ, and the counter will be incremented. Also mark all the methods to access these statistics counters as optional. In the future we may also provide a "verbose" option and show these statistics only when the scheduler runs in verbose mode. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 17:36:06 +01:00
Andrea Righi	796a7ebc0e	scx_rustland: provide an abstraction layer for the BPF component Move the code responsible for interfacing with the BPF component into its own module and provide high-level abstractions for the user-space scheduler, hiding all the internal BPF implementation details. This makes the user-space scheduler code much more readable and it allows potential developers/contributors that want to focus at the pure scheduling details to modify the scheduler in a generic way, without having to worry about the internal BPF details. In the future we may even decide to provide the BPF abstraction as a separate crate, that could be used as a baseline to implement user-space schedulers in Rust. API overview ============ The main BPF interface is provided by BpfScheduler(). When this object is initialized it will take care of registering and initializing the BPF component. Then the scheduler can use the BpfScheduler() instance to receive tasks (in the form of QueuedTask object) and dispatch tasks (in the form of DispatchedTask objects), using respectively the methods dequeue_task() and dispatch_task(). The CPU ownership map can be accessed using the method get_cpu_pid(), this also allows to keep track of the idle and busy CPUs, with the corrsponding PIDs associated to them. BPF counters and statistics can be accessed using the methods nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be updated to notify the BPF component if the user-space scheduler has some pending work to do or not. Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can be used respectively to read the exit code and exit message from the BPF component, when the scheduler is unregistered. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 16:49:09 +01:00
Jordan Rome	5bacefcdbe	Add README files for each rust scheduler This because each scheduler has it's own Rust Crate and it's better if they had a README associated with each one. https://crates.io/crates/scx_layered	2024-01-04 07:35:44 -08:00
Andrea Righi	7c11837a61	scx_rustland: make dispatcher more robust We always try to use the current CPU (from the .dispatch() callback) to run the user-space scheduler itself and if the current CPU is not usable (according to the cpumask) we just re-use the previouly used CPU. However, if the previously used CPU is also not usable, we may trigger the following error: sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201]) Potentially this can also happen with any task, so improve the dispatch logic as following: - dispatch on the target CPU, if usable - otherwise dispatch on the previously used CPU, if usable - otherwise dispatch on the global DSQ Moreover, rename dispatch_on_cpu() -> dispatch_task() for better clarity. This should be enough to handle all the possible decisions made by the user-space scheduler, making the dispatcher more robust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 10:21:40 +01:00
Andrea Righi	69c1dfc03c	scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check In the dispatch callback we can dispatch tasks to any CPU, according to the scheduler decisions, so there's no reason to check for the available dispatch slots in the current CPU only, to determine if we need to stop dispatching tasks. Since the scheduler is aware of the idle state of the CPUs (via the CPU ownership map) it has all the information to automatically regulate the flow of dispatched tasks and not overflow the dispatch slots, therefore it is safe to remove this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 09:41:54 +01:00
Andrea Righi	6b1e7d927d	scx_rustland: update comments and documentation in the BPF part No functional change, only a little polishing, including updates to comments and documentation to align with the latest changes in the code. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 09:40:49 +01:00

1 2 3

138 Commits