24fba4ab8d ("scx_layered: Add idle smt layer configuration") added the
idle_smt layer config but
- It flipped the default from preferring idle cores to not having
preference.
- Misdocumented what it meant. It doesn't only pick idle cores. It tries to
pick an idle core first and if that fails pick any idle CPU.
A follow-up commit 637fc3f6e1 ("scx_layered: Use layer idle_smt option")
made it more confusing. If idle_smt is set, the idle core prioritizing logic
is disabled.
The first commit disables idle core prioritization by overriding
idle_smtmask to be idle_cpumask if idle_smt is *clear* and the second commit
disables the same by disabling the code path when the flag is *set*. ie.
Both options did exactly the same thing.
Recently, 75dd81e3e6 ("scx_layered: Improve topology aware select_cpu()")
restored the function of the flag by dropping the cpumask override. However,
this made the actual behavior the opposite of the documented one by leaving
only the behavior of the second commit which implemented the reverse
behavior.
This flag is hopeless. History aside, the name itself is too confusing.
idle_smt - is it saying that the flag is going to prefer idle smt *thread*
or idle smt *core*? While the name is transferred from idle_cpumask/smtmask,
there, the meaning of the former is clear which also makes it difficult to
confuse what the latter means.
Preferring idle cores was one of the drivers of performance gain identified
during earlier ads experiments. Let's just drop the flag to restore the
previous behavior and retry if necessary.
layer_usages are updated at ops.stopping(). If tasks in a layer keep
running, it may not update the usage stats for a long time to the point
where the reported usage stats fluctuate wildly making CPU allocation
oscillate. Compensate for it by adding the time spent for the currently
running task.
Per-LLC layer queueing latencies were measured on each ops.running()
transition and runtime avarged. Depending on the specific task, this average
can swing wildly and it's difficult to base scheduling decisions on them.
Instead, track per-task average runtime and then use them to determine the q
latency as the sum of the average runtimes of the tasks on the q. While this
requires atomic ops to maintain the sum, the operations are mostly LLC local
and not noticeable. The up-to-date information will help making better
scheduling decisions which should more than offset whatever additional
overhead.
LLC_LSTAT_LAT is now only used to monitor how llc_ctx->queued_runtime is
behaving, so decay it slower. Also, don't quelch it to zero when
LLC_LSTAT_CNT is 0 so that bugs in queued_runtime maintenance are visible.
The plan was using load metric for layer fairness but we went for explicit
per-layer weight instead. Load metric is not used for anything and doesn't
really add much. Remove it.
Fix idle selection to take into account the layer growth algorithm to
properly select big/little cores when selecting idle CPUs.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
There were a couple bugs in owned execution protection in
layered_stopping():
- owned_usage_target_ppk is the property of the CPU's owning layer. However,
we were incorrectly using the task's layer's.
- A fallback CPU can belong to a layer under system saturation and both
empty and in-layer execution counts as owned execution. However, we were
incorrectly always setting owned_usage_target_ppk to 50% for a fallback
CPU when the owner layer's target_ppk could be a lot higher. This could
effectively take away ~50% CPU util from a layer which is trying to grow
and prevent it from growing. Don't lower target_ppk for being the fallback
CPU.
After these fixes, `stress` starting in an empty non-preempting layer can
reliably grow the layer to its weighted size while competing against
saturating preempting layers.
Add _frac to util_protected and util_open as they are fractions of the total
util of the layer. While at it, swap the two fields as util_open_frac
directly affects layer sizing.
Use BTreeMap to store cache_id_map so that prog can show
L2/L3 cache info in ascending order, make it easy to lookup by human.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
scx_layered userspace is critical in guaranteeing forward progress in timely
manner under contention. Always put them in hi fallback DSQ. Also, as we're
now boosting hi fallback above preempt, drop the preempt check before
boosting.
Having two separate implementations for topo and no-topo cases makes the
code difficult to make modifications and maintain. Reimplement so that:
- Layer and LLC orders are determined by userspace and the BPF code iterates
over them. The iterations are performed using two helpers. The new
implementations overhead is lower for both topo and no-topo paths.
- In-layer execution protection is always enforced.
- Hi fallback is prioritized over preempting layers to avoid starving
kthreads. This ends up also prioritizing tasks w/ custom affinities. Once
lo fallback starvation avoidance will be implemented, those will be pushed
there.
- Fallback CPU prioritizes empty over preempt layers to guarantee that empty
layers can quickly grow under saturation. Empty layers are set by
userspace and there can be a race where the layer executing scx_layered
itself became empty and the cpumasks for the layer is updated and then
scx_layered gets pushed off CPU before empty layers are updated, which can
lead to stall of scx_layered binary. This will be solved by treating
scx_layered process specially.
Now that layers are allocated CPUs according to their weights under system
saturation and in-layer execution can be protected against preemption, layer
weights can be enforced solely through CPU allocation. While there are still
a couple missing pieces - dispatch ordering issue and fallback DSQ
protection, the framework is in place. Drop the now superfluos cost based
fairness mechanism.
The previous commit made empty layers executing on the fallback CPU count as
owned execution instead of open so that it can be protected from preemption.
This broke target number of CPUs calculation for empty layers as it was only
looking at open execution time. Update calc_target_nr_cpus() so that it
considers both owned and open execution time for empty layers.
Currently, a preempting layer can completely starve out non-preempting ones
regardless of weight or other configurations. Implement protection from
preemption where each CPU tries to protect in-layer execution beyond the
high util range and upto full utilization under saturation. Fallback CPU is
also protected to run empty layers upto 50% to guarantee that empty layers
can easily start growing.
Commit d971196 ("scx_utils: Rename hw_id and add sequential llc id")
makes llc id unique across NUMA nodes, so rely on this value to build
the LLC scheduling domain.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Commit d971196 ("scx_utils: Rename hw_id and add sequential llc id")
makes llc id unique across NUMA nodes, so rely on this value to build
the LLC scheduling domain.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
The scheduler extends the lock holder's time slice at ops.dispatch()
to avoid preempting the lock holder, so slowing down the system-wide
progress. However, this opens the possibility that the slice extension
is abused by a lock holder. To mitigate the problem, check if a task's
time slice is extended (lock_holder_xted) but the task is not lock holder.
That means a task's time slice was extended, but it released the lock
after that. In this case, give up the rest of the extended time slice of
the task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Make llc_id a monotonically increasing unique value and rename hw_id to
kernel_id for topology structs.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
On systems with multiple NUMA nodes core_ids can be reused. Create a
hw_id that is monotonically increasing that can be used to uniquely
identiy CPU cores.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>