Schedule all tasks using a single global DSQ. This gives a better
control to prevent potential starvation conditions.
With this change, scx_bpfland adopts a logic similar to scx_rusty and
scx_lavd, prioritizing tasks based on the frequency of their wait and
wake-up events, rather than relying exclusively on the average amount of
voluntary context switches.
Tasks are still classified as interactive / non-interactive based on the
amount of voluntary context switches, but this is only affecting the
cpufreq logic.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Since tasks' average runtimes show skewed distribution, directly
using the runtime in the deadline calculation causes several
performance regressions. Instead, let's use the constant factor
and further prioritize frequency factors to deprioritize the long
runtime tasks.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Revert the change of sending self-IPI at preemption when a victim
CPU is the current CPU. The cost of self-IPI is prohibitively expensive
in some workloads (e.g., perf bench). Instead, resetting task' time slice
to zero.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Rather then always migrating tasks across LLC domains when no idle CPU
is available in their current LLC domain, allow migration but attempt to
bring tasks back to their original LLC domain whenever possible.
To do so, define the task's scheduling domain upon task creation or when
its affinity changes, and ensure the task remains within this domain
throughout its lifetime.
In the future we will add a proper load balancing logic, but for now
this change seems to provide consistent performance improvement in
certain server workloads.
For example, simple CUDA benchmarks show a performance boost of about
+10-20% with this change applied (on multi-LLC / NUMA machines).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
This allows to prevent excessive starvation of regular tasks in presence
of high amount of interactive tasks (e.g., when running stress tests,
such as hackbench).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
This can lead to stalls when a high number of interactive tasks are
running in the system (i.e.., hackbench or similar stress tests).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using
bpf_task_from_pid() and the scheduler can handle exiting tasks.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Ensure that task vruntime is always updated in ops.running() to maintain
consistency with other schedulers.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Fix task filtering logic error to avoid the possibility of migrate the
same task over again. The orginal logic operation was "||" which might
include tasks already migrated to be taken into consideration again.
Change the condition to "&&" so we can elimate the error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Inside the function "try_find_move_task()", it returns directly when
there's no task found to be moved. If the cause is from lack of ability
to fulfilled the condition by "task_filter()", load balancer will try to
find move task again and remove "task_filter()" by setting it directly
to a function returns true.
However, in the fallback case, the tasks within the domains will be
empty. Swap the tasks back into domains vector before returning can
solve the issue.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The combination of kernel versions and kerenl configs generates
different kernel symbols. For example, in an old kernel version,
__mutex_lock() is not generated. Also, there is no workaround
from the fentry/fexit/kprobe side currently. Let's entirely drop
the kernel locking for now and revisit it later.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Revised the lock tracking code, relying on stable symbols with various
kernel configurations. There are two changes:
- Entirely drop tracing rt_mutex, which can be on and off with kconfig
- Replace mutex_lock() families to __mutex_lock(), which is stable
across kernel configs. The downside of such change is it is now
possible to trace the lock fast path, so lock tracing is a bit less
accurate. But let's live with it for now until a better solution is found.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fallback DSQs are not accounted with costs. If a layer is saturating the
machine it is possible to not consume from the fallback DSQ and stall
the task. This introduces and additional consumption from the fallback
DSQ when a layer runs out of budget. In addition, tasks that use partial
CPU affinities should be placed into the fallback DSQ. This change was
tested with stress-ng --cacheline `nproc` for several minutes without
causing stalls (which would stall on main).
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Flip the order of layer id vs layer name so that the output makes sense.
Example output:
LO_FALLBACK nr_queued=0 -0ms
COST GLOBAL[0][random] budget=22000000000 capacity=22000000000
COST GLOBAL[1][hodgesd] budget=0 capacity=0
COST GLOBAL[2][stress-ng] budget=0 capacity=0
COST GLOBAL[3][normal] budget=0 capacity=0
COST CPU[0][0][random] budget=62500000000000 capacity=62500000000000
COST CPU[0][1][random] budget=100000000000000 capacity=100000000000000
COST CPU[0][2][random] budget=124911500964411 capacity=125000000000000
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The dynamic nvcsw threshold is not used anymore in the scheduler and it
doesn't make sense to report it in the scheduler's statistics, so let's
just drop it.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value
specified by --nvcsw-max-thresh.
This allows to tune the maximum latency weight when running in
lowlatency mode (via --nvcsw-max-thresh) and it also restores the
maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed
during the lowlatency refactoring.
Fixes: 4d68133 ("scx_bpfland: rework lowlatency mode to adjust tasks priority")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Evalute the amount of voluntary context switches directly in the BPF
code, without relying on the kernel p->nvcsw metric.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add the layer CPU cost when dumping. This is useful for understanding
the per layer cost accounting when layered is stalled.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add the layer name to the bpf representation of a layer. When printing
debug output print the layer name as well as the layer index.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The type of "taskc" within "lavd_dispatch()" was "struct task_struct *",
while it should be "struct task_ctx *".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor dispatch to use a separate set of global helpers for topo aware
dispatch. This change only refactors dispatch to make it more
maintainable, without any functional changes.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Pinning a task to a single CPU is a widely-used optimization to
improve latency by reusing cache. So when a task is pinned to
a single CPU, let's boost its latency criticality.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Even in the direct dispatch path, calculating the task's latency
criticality is still necessary since the latency criticality is
used for the preemptablity test. This addressed the following
GitHub issue:
https://github.com/sched-ext/scx/issues/856
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add cost accounting for layers to make weights work on the BPF side.
This is done at both the CPU level as well as globally. When a CPU
runs out of budget it acquires budget from the global context. If a
layer runs out of global budgets then all budgets are reset. Weight
handling is done by iterating over layers by their available budget.
Layers budgets are proportional to their weights.
When the current task is decided to yield, we should explicitly call
scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time
slice to zero is not sufficient in this because the sched_ext core
does not call resched_curr() at the ops.enqueue() path.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
An eligible task is unlikely preemptible. In other words, an ineligible
task is more likely preemptible since its greedy ratio penalty in virtual
deadline calculation. Hence, we skip the predictability test
for an eligible task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Some of the new timer code doesn't verify on older kernels like 6.9. Modify the
code a little to get it verifying again.
Also applies some small fixes to the logic. Error handling was a little off
before and we were using the wrong key in lookups.
Test plan:
- CI
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Currently we have an approximation of LayerKind in the BPF code with `open` on
the layer, but it is difficult/impossible to tell the difference between an
Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb
through an enum from the Rust side.
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Giving more penalties to a long-running tasks helps to segregate
latency-critical tasks, which are usually short-running, to
long-running tasks, which are compute-intensive.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Rework per-arch vmlinux solution
* have per-arch directory under sched/include/arch/, in which we
maintain vmlinux.h symlink and real file
vmlinux-{kernel_ver}-g{sha1}.h. The original sched/include/vmlinux/
folder is removed.
* update meson build `-I` option to find the new vmlinux.h position
* update cargo build scripts to use the per-arch vmlinux.h for
generating bindings
* keep the original ClangInfo refactoring changes
Signed-off-by: Ming Yang <minos.future@gmail.com>
Adjust the amount of vruntime budget an idle task can accumulate in
function of its latency weight, which is derived from the average number
of voluntary context switches.
This ensures that latency-sensitive tasks naturally receive an
additional priority boost and we can get avoid scaling down the vruntime
to determine the task's deadline, making the scheduler more fair.
It also makes the scheduler more robust, now rustland can survive
intensive stress tests, such as `stress-ng --cpu-sched 64` or hackbench.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Building CpuPool from cache-cpu topology did not apply on arm, because
`/sys/devices/system/cpu/cpu{}/cache/index{}/id` file is unavailable.
Read CPU topology instead.
Signed-off-by: Ming Yang <minos.future@gmail.com>
Adjust some default settings after the rework done with commit 112a5d4
("scx_bpfland: rework lowlatency mode to adjust tasks priority").
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rework lowlatency mode as following:
- introduce task dynamic priority: task weight multiplied by the
average amount of voluntary context switches
- use dynamic priority to determine task's vruntime (instead of the
static task's weight)
- task's minimum vruntime is evaluated in function of the dynamic
priority (tasks with a higher dynamic priority can have a smaller
vruntime compared to tasks with a lower dynamic priority)
The dynamic priority allows to maintain a good system responsiveness
also without applying the classification of tasks in "interactive" and
"regular", therefore in lowlatency mode only the shared DSQ will be
used (priority DSQ is disabled).
Using a separate priority queue to dispatch "interactive" tasks makes
the scheduler less fair, allowing latency-sensitive tasks to be
prioritized even when there is a high number of tasks in the system
(e.g., `stress-ng -c 1024` or similar scenarios), where relying solely
on dynamic priority may not be sufficient.
On the other hand, disabling the classification of "interactive" tasks
results in a fairer scheduler and more predictable performance, making
it better suited for soft real-time applications (e.g, audio and
multimedia).
Therefore, the --lowlatency option is retained to allow users to choose
between more predictable performance (by disabling the interactive task
classification) or a more responsive system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Update the documentation adding the new task statistics provided by
scx_rustland_core.
Fixes: be681c7 ("scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The recent changes to `disable_topology` making the arg an `Option<bool>`
instead of a `bool` caused an issue with it incorrectly attaching arguments.
Make the argument `require_equals` to fix this case.
This is a behaviour change for anybody previously relying on `-t true`,
`-t false`, `--disable-topology true`, or `--disable-topology false`. The
equals syntax worked before and continues to work after, as demonstrated in the
CI.
Test plan:
Before:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
error: invalid value 'f:/tmp/test.json' for '--disable-topology
[<DISABLE_TOPOLOGY>]'
[possible values: true, false]
For more information, try '--help'.
```
After:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
14:44:00 [INFO] CPUs: online/possible=176/176 nr_cores=88
14:44:00 [INFO] Disabling topology awareness
...
^CEXIT: Scheduler unregistered from user space
```
Add an additional layer growth algorithm, named 'RandomTopo'. It follows these
rules:
- Randomise NUMA nodes. List each core in each NUMA node before a core from
another NUMA node.
- Randomise LLCs within each NUMA node. List each core in each LLC before a
core in a different LLC.
- Randomise the core order within each LLC.
This attempts to provide a relatively evenly distributed set of cores while
considering topology. Unlike `Topo`, it does not require you to specify the
ordering and instead generates it from the hardware, making desyncs between the
config and the hardware less likely.
Currently `RandomTopo` considers topology even with `--disable-topology=true`.
I can see the arguments for this going both ways. On one hand requesting
disable topology suggests you want no consideration of machine topology, and
`RandomTopo` should decay to `Random` (which it does on single node/LLC machines
anyway). On the other hand, the config explicitly specifies `RandomTopo` and
should consider the topology. If anyone feels strongly I can change this to
respect `disable_topology`.
Test plan:
```sh
$ sudo target/release/scx_layered -v f:/tmp/test.json
...
14:31:19 [DEBUG] layer: batch algo: RandomTopo core order: [47, 44, 43, 42, 40, 45, 46, 41, 38, 37, 36, 39, 34, 32, 35, 33, 54, 49, 50, 52, 51, 48, 55, 53, 68, 64, 66, 67, 70, 69, 71, 65, 9, 10, 12, 15, 14, 11, 8, 13, 59, 60, 57, 63, 62, 56, 58, 61, 2, 3, 5, 4, 0, 6, 7, 1, 86, 83, 85, 87, 84, 81, 80, 82, 20, 22, 19, 23, 21, 18, 17, 16, 30, 25, 26, 31, 28, 27, 29, 24, 78, 73, 74, 79, 75, 77, 76, 72]
14:31:19 [DEBUG] layer: immediate algo: RandomTopo core order: [45, 40, 46, 42, 47, 43, 41, 44, 80, 82, 83, 84, 85, 86, 81, 87, 13, 10, 9, 15, 14, 12, 11, 8, 36, 38, 39, 32, 34, 35, 33, 37, 7, 3, 1, 0, 2, 5, 4, 6, 53, 52, 54, 48, 50, 49, 55, 51, 76, 77, 79, 78, 73, 74, 72, 75, 71, 66, 64, 67, 70, 69, 65, 68, 24, 26, 31, 25, 28, 30, 27, 29, 58, 56, 59, 61, 57, 62, 60, 63, 16, 19, 17, 23, 22, 20, 18, 21]
...
```
This is a machine with 1 NUMA/11 LLCs with 8 cores per LLC and you can see the
results are grouped by LLC but random within.
Make scx_rlfifo even simpler and keep dispatching tasks even if the CPUs
are all busy.
This allows to better stress test the scx_rustland_core backend, by
using both the per-CPU DSQs and the global shared DSQ.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>