Commit Graph

2384 Commits

Author SHA1 Message Date
Tejun Heo
b7b15ac4b1 scx_layered: Deprecate idle_smt layer config
24fba4ab8d ("scx_layered: Add idle smt layer configuration") added the
idle_smt layer config but

- It flipped the default from preferring idle cores to not having
  preference.

- Misdocumented what it meant. It doesn't only pick idle cores. It tries to
  pick an idle core first and if that fails pick any idle CPU.

A follow-up commit 637fc3f6e1 ("scx_layered: Use layer idle_smt option")
made it more confusing. If idle_smt is set, the idle core prioritizing logic
is disabled.

The first commit disables idle core prioritization by overriding
idle_smtmask to be idle_cpumask if idle_smt is *clear* and the second commit
disables the same by disabling the code path when the flag is *set*. ie.
Both options did exactly the same thing.

Recently, 75dd81e3e6 ("scx_layered: Improve topology aware select_cpu()")
restored the function of the flag by dropping the cpumask override. However,
this made the actual behavior the opposite of the documented one by leaving
only the behavior of the second commit which implemented the reverse
behavior.

This flag is hopeless. History aside, the name itself is too confusing.
idle_smt - is it saying that the flag is going to prefer idle smt *thread*
or idle smt *core*? While the name is transferred from idle_cpumask/smtmask,
there, the meaning of the former is clear which also makes it difficult to
confuse what the latter means.

Preferring idle cores was one of the drivers of performance gain identified
during earlier ads experiments. Let's just drop the flag to restore the
previous behavior and retry if necessary.
2024-11-30 00:19:39 -10:00
Tejun Heo
14af41d0dd
Merge pull request #1012 from sched-ext/htejun/layered-updates
scx_layered: State tracking updates and layer sizing related fixes
2024-11-29 15:31:53 +00:00
Tejun Heo
c21c710e6f scx_layered: Compensate for layer usage update fluctuations
layer_usages are updated at ops.stopping(). If tasks in a layer keep
running, it may not update the usage stats for a long time to the point
where the reported usage stats fluctuate wildly making CPU allocation
oscillate. Compensate for it by adding the time spent for the currently
running task.
2024-11-29 01:17:45 -10:00
Tejun Heo
cc9d9c2e5d scx_layered: Kick CPU if idle after assigning it to a layer
So that a CPU doesn't sit idle while tasks are waiting on the CPU's new
layer.
2024-11-29 00:25:54 -10:00
Tejun Heo
1a63c87812 scx_layered: Add more debug visibility to calc_target_nr_cpus() 2024-11-29 00:25:15 -10:00
Tejun Heo
49795bd8f2 scx_layered: Implement per-task avg runtime tracking and use it to calculate q latency
Per-LLC layer queueing latencies were measured on each ops.running()
transition and runtime avarged. Depending on the specific task, this average
can swing wildly and it's difficult to base scheduling decisions on them.

Instead, track per-task average runtime and then use them to determine the q
latency as the sum of the average runtimes of the tasks on the q. While this
requires atomic ops to maintain the sum, the operations are mostly LLC local
and not noticeable. The up-to-date information will help making better
scheduling decisions which should more than offset whatever additional
overhead.

LLC_LSTAT_LAT is now only used to monitor how llc_ctx->queued_runtime is
behaving, so decay it slower. Also, don't quelch it to zero when
LLC_LSTAT_CNT is 0 so that bugs in queued_runtime maintenance are visible.
2024-11-29 00:02:04 -10:00
Tejun Heo
99739bcd8d scx_layered: Remove load metric
The plan was using load metric for layer fairness but we went for explicit
per-layer weight instead. Load metric is not used for anything and doesn't
really add much. Remove it.
2024-11-28 16:25:15 -10:00
Tejun Heo
6a31b43c1b scx_layered: Remove now unused const volatile disable_topology 2024-11-28 16:17:02 -10:00
Tejun Heo
5b57cdf3ad
Merge pull request #1008 from sched-ext/htejun/layered-updates
scx_layered: Prioritize sched userspace and fix owned execution protection
2024-11-28 19:03:38 +00:00
Tejun Heo
4a95873bb7 Apply suggestions from code review
Co-authored-by: Jake Hillion <jakehillion@meta.com>
2024-11-28 08:55:11 -10:00
Tejun Heo
3a1e67318d
Merge pull request #1002 from luigidematteis/remove-deprecated-bindgen-api-usage
scx_utils: remove use of deprecated bindgen API; require bindgen >=0.69
2024-11-28 17:42:31 +00:00
Daniel Hodges
071e9465d9
Merge pull request #1007 from hodgesds/layered-big-little-refactor
scx_layered: Fix idle selection on big/little
2024-11-28 17:34:01 +00:00
Daniel Hodges
7153a4a150 scx_layered: Fix idle selection on big/little
Fix idle selection to take into account the layer growth algorithm to
properly select big/little cores when selecting idle CPUs.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-28 12:03:30 -05:00
Tejun Heo
0eb9bb701f scx_layered: Fix owned execution protection in layered_stopping()
There were a couple bugs in owned execution protection in
layered_stopping():

- owned_usage_target_ppk is the property of the CPU's owning layer. However,
  we were incorrectly using the task's layer's.

- A fallback CPU can belong to a layer under system saturation and both
  empty and in-layer execution counts as owned execution. However, we were
  incorrectly always setting owned_usage_target_ppk to 50% for a fallback
  CPU when the owner layer's target_ppk could be a lot higher. This could
  effectively take away ~50% CPU util from a layer which is trying to grow
  and prevent it from growing. Don't lower target_ppk for being the fallback
  CPU.

After these fixes, `stress` starting in an empty non-preempting layer can
reliably grow the layer to its weighted size while competing against
saturating preempting layers.
2024-11-28 06:37:41 -10:00
Tejun Heo
fd9267fe91 scx_layered: Rename two stat fields
Add _frac to util_protected and util_open as they are fractions of the total
util of the layer. While at it, swap the two fields as util_open_frac
directly affects layer sizing.
2024-11-28 05:31:31 -10:00
Luigi De Matteis
4e4fe034fc scx_utils: remove use of deprecated bindgen API; require bindgen >=0.69
Signed-off-by: Luigi De Matteis <ldematteis123@gmail.com>
2024-11-28 12:16:41 +02:00
Andrea Righi
1f4bd50e2f
Merge pull request #999 from mmz-zmm/scx_bpfland-fix
scx_bpfland: dump cache_id_map in ascending order
2024-11-28 09:06:51 +00:00
Zhao Mengmeng
08650f52dc scx_bpfland: dump cache_id_map in ascending order
Use BTreeMap to store cache_id_map so that prog can show
L2/L3 cache info in ascending order, make it easy to lookup by human.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
2024-11-28 15:06:21 +08:00
Tejun Heo
e65903ce1e scx_layered: Boost scx_layered userspace execution
scx_layered userspace is critical in guaranteeing forward progress in timely
manner under contention. Always put them in hi fallback DSQ. Also, as we're
now boosting hi fallback above preempt, drop the preempt check before
boosting.
2024-11-27 20:19:27 -10:00
Tejun Heo
97b29d362c
Merge pull request #997 from sched-ext/htejun/layered-updates-more
scx_layered: Reimplement layered_dispatch()
2024-11-28 05:41:28 +00:00
Tejun Heo
e9e67e8ce2 Merge branch 'main' into htejun/layered-updates-more 2024-11-27 16:40:11 -10:00
Tejun Heo
0b237c638e scx_layered: Reimplement layered_dispatch()
Having two separate implementations for topo and no-topo cases makes the
code difficult to make modifications and maintain. Reimplement so that:

- Layer and LLC orders are determined by userspace and the BPF code iterates
  over them. The iterations are performed using two helpers. The new
  implementations overhead is lower for both topo and no-topo paths.

- In-layer execution protection is always enforced.

- Hi fallback is prioritized over preempting layers to avoid starving
  kthreads. This ends up also prioritizing tasks w/ custom affinities. Once
  lo fallback starvation avoidance will be implemented, those will be pushed
  there.

- Fallback CPU prioritizes empty over preempt layers to guarantee that empty
  layers can quickly grow under saturation. Empty layers are set by
  userspace and there can be a race where the layer executing scx_layered
  itself became empty and the cpumasks for the layer is updated and then
  scx_layered gets pushed off CPU before empty layers are updated, which can
  lead to stall of scx_layered binary. This will be solved by treating
  scx_layered process specially.
2024-11-27 16:21:00 -10:00
likewhatevs
09445b1a49
Merge pull request #996 from sched-ext/htejun/layered-updates
scx_layered: Implement in-layer execution protection to replace cost based fairness
2024-11-27 18:59:32 -05:00
Tejun Heo
bec5265e7b scx_layered: Build LLC proximity map
This will be used to simplify dispatch path.
2024-11-27 13:47:55 -10:00
Tejun Heo
eb817ca409 scx_layered: Move cpu_ctx functions into Scheduler for consistency
No functional changes.
2024-11-27 12:41:16 -10:00
Changwoo Min
30250e8d85
Merge pull request #990 from multics69/lavd-lhp-fairness
scx_lavd: Limit the slice extension of a lock holder
2024-11-27 22:11:25 +00:00
Daniel Hodges
d0591cbffe
Merge pull request #995 from hodgesds/freq-tracd
scripts: Add bpftrace script to trace CPU frequency
2024-11-27 21:12:04 +00:00
Tejun Heo
b18e14e220 Merge branch 'main' into htejun/layered-updates 2024-11-27 11:10:13 -10:00
Tejun Heo
f30950c2b8 scx_layered: Drop cost based fairness code
Now that layers are allocated CPUs according to their weights under system
saturation and in-layer execution can be protected against preemption, layer
weights can be enforced solely through CPU allocation. While there are still
a couple missing pieces - dispatch ordering issue and fallback DSQ
protection, the framework is in place. Drop the now superfluos cost based
fairness mechanism.
2024-11-27 10:42:48 -10:00
Tejun Heo
1491d5c1f8 scx_layered: Empty layers should consider owned+open when calculating target nr_cpus
The previous commit made empty layers executing on the fallback CPU count as
owned execution instead of open so that it can be protected from preemption.
This broke target number of CPUs calculation for empty layers as it was only
looking at open execution time. Update calc_target_nr_cpus() so that it
considers both owned and open execution time for empty layers.
2024-11-27 10:23:24 -10:00
Tejun Heo
205e5b4e29 scx_layered: Implement in-layer execution protection from preemption
Currently, a preempting layer can completely starve out non-preempting ones
regardless of weight or other configurations. Implement protection from
preemption where each CPU tries to protect in-layer execution beyond the
high util range and upto full utilization under saturation. Fallback CPU is
also protected to run empty layers upto 50% to guarantee that empty layers
can easily start growing.
2024-11-27 09:52:54 -10:00
Daniel Hodges
fb287865f4 scripts: Add bpftrace script to trace CPU frequency
Add a bpftrace script for tracing frequency changes.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-27 11:39:26 -08:00
likewhatevs
520116fe44
Merge pull request #992 from likewhatevs/fix-pages-build
fix document generation
2024-11-27 15:00:17 +00:00
Andrea Righi
37f4e72d55
Merge pull request #993 from sched-ext/scheds-use-llc-id
scheds: use llc id
2024-11-27 08:36:31 +00:00
Andrea Righi
ae45131849 scx_flash: rely on llc_id instead of l3_id
Commit d971196 ("scx_utils: Rename hw_id and add sequential llc id")
makes llc id unique across NUMA nodes, so rely on this value to build
the LLC scheduling domain.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-27 08:25:09 +01:00
Andrea Righi
ce84f42158 scx_bpfland: rely on llc_id instead of l3_id
Commit d971196 ("scx_utils: Rename hw_id and add sequential llc id")
makes llc id unique across NUMA nodes, so rely on this value to build
the LLC scheduling domain.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-27 08:20:07 +01:00
Pat Somaru
f70aac1a5f
fix document generation 2024-11-27 00:47:21 -05:00
Changwoo Min
0213af5e17 scx_lavd: Limit the slice extension for a lock holder
The scheduler extends the lock holder's time slice at ops.dispatch()
to avoid preempting the lock holder, so slowing down the system-wide
progress. However, this opens the possibility that the slice extension
is abused by a lock holder. To mitigate the problem, check if a task's
time slice is extended (lock_holder_xted) but the task is not lock holder.
That means a task's time slice was extended, but it released the lock
after that. In this case, give up the rest of the extended time slice of
the task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-27 10:52:04 +09:00
Daniel Hodges
c9f8ffa66f
Merge pull request #982 from hodgesds/topo-fixes
scx_utils: Add core_id/llc_id to topology as a unique identifiers
2024-11-27 01:38:25 +00:00
Daniel Hodges
d971196595 scx_utils: Rename hw_id and add sequential llc id
Make llc_id a monotonically increasing unique value and rename hw_id to
kernel_id for topology structs.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-26 17:32:09 -08:00
Tejun Heo
641d6357bb
Merge pull request #985 from abrehman94/main
Introduce for_each_possible_cpu() for_each_online_cpu() iterators
2024-11-26 22:02:08 +00:00
Daniel Hodges
f3bbbcfaf8 scx_layered: Update core_id to be unique
On systems with multiple NUMA nodes core_ids can be reused. Create a
hw_id that is monotonically increasing that can be used to uniquely
identiy CPU cores.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-26 12:48:34 -08:00
Abdul Rehman
0936414706 Introduce for_each_possible_cpu() for_each_online_cpu() iterators
fixes: #966
2024-11-26 14:52:21 -05:00
Changwoo Min
c15d9ead01
Merge pull request #984 from multics69/lavd-cleanup-v2
scx_lavd: Minor code clean up
2024-11-26 10:27:20 +09:00
Changwoo Min
801111adb1
Merge branch 'sched-ext:main' into lavd-cleanup-v2 2024-11-26 10:25:43 +09:00
Tejun Heo
10339d35a4
Merge pull request #983 from CachyOS/fixes/loader-scx-stalls
scx_loader: restart scheduler upon fail
2024-11-25 22:23:36 +00:00
Daniel Hodges
5726eee03a
Merge pull request #965 from hodgesds/layered-growth-refactor
scx_layered: Refactor layer growth order
2024-11-25 22:01:11 +00:00
Vladislav Nepogodin
944510d1fc
scx_loader: restart scheduler upon fail
fixes https://github.com/sched-ext/scx/issues/937
2024-11-26 01:43:29 +04:00
Daniel Hodges
f2c963c065
Merge branch 'main' into layered-growth-refactor 2024-11-25 14:38:14 -05:00
Tejun Heo
c7b87eb3c3
Merge pull request #980 from sched-ext/htejun/layered-updates
scx_layered: Track owned/open execution times and per-LLC-layer stats
2024-11-25 19:09:15 +00:00