Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Currently we have an approximation of LayerKind in the BPF code with `open` on
the layer, but it is difficult/impossible to tell the difference between an
Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb
through an enum from the Rust side.
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Giving more penalties to a long-running tasks helps to segregate
latency-critical tasks, which are usually short-running, to
long-running tasks, which are compute-intensive.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Rework per-arch vmlinux solution
* have per-arch directory under sched/include/arch/, in which we
maintain vmlinux.h symlink and real file
vmlinux-{kernel_ver}-g{sha1}.h. The original sched/include/vmlinux/
folder is removed.
* update meson build `-I` option to find the new vmlinux.h position
* update cargo build scripts to use the per-arch vmlinux.h for
generating bindings
* keep the original ClangInfo refactoring changes
Signed-off-by: Ming Yang <minos.future@gmail.com>
Adjust the amount of vruntime budget an idle task can accumulate in
function of its latency weight, which is derived from the average number
of voluntary context switches.
This ensures that latency-sensitive tasks naturally receive an
additional priority boost and we can get avoid scaling down the vruntime
to determine the task's deadline, making the scheduler more fair.
It also makes the scheduler more robust, now rustland can survive
intensive stress tests, such as `stress-ng --cpu-sched 64` or hackbench.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Building CpuPool from cache-cpu topology did not apply on arm, because
`/sys/devices/system/cpu/cpu{}/cache/index{}/id` file is unavailable.
Read CPU topology instead.
Signed-off-by: Ming Yang <minos.future@gmail.com>
Adjust some default settings after the rework done with commit 112a5d4
("scx_bpfland: rework lowlatency mode to adjust tasks priority").
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rework lowlatency mode as following:
- introduce task dynamic priority: task weight multiplied by the
average amount of voluntary context switches
- use dynamic priority to determine task's vruntime (instead of the
static task's weight)
- task's minimum vruntime is evaluated in function of the dynamic
priority (tasks with a higher dynamic priority can have a smaller
vruntime compared to tasks with a lower dynamic priority)
The dynamic priority allows to maintain a good system responsiveness
also without applying the classification of tasks in "interactive" and
"regular", therefore in lowlatency mode only the shared DSQ will be
used (priority DSQ is disabled).
Using a separate priority queue to dispatch "interactive" tasks makes
the scheduler less fair, allowing latency-sensitive tasks to be
prioritized even when there is a high number of tasks in the system
(e.g., `stress-ng -c 1024` or similar scenarios), where relying solely
on dynamic priority may not be sufficient.
On the other hand, disabling the classification of "interactive" tasks
results in a fairer scheduler and more predictable performance, making
it better suited for soft real-time applications (e.g, audio and
multimedia).
Therefore, the --lowlatency option is retained to allow users to choose
between more predictable performance (by disabling the interactive task
classification) or a more responsive system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Update the documentation adding the new task statistics provided by
scx_rustland_core.
Fixes: be681c7 ("scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The recent changes to `disable_topology` making the arg an `Option<bool>`
instead of a `bool` caused an issue with it incorrectly attaching arguments.
Make the argument `require_equals` to fix this case.
This is a behaviour change for anybody previously relying on `-t true`,
`-t false`, `--disable-topology true`, or `--disable-topology false`. The
equals syntax worked before and continues to work after, as demonstrated in the
CI.
Test plan:
Before:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
error: invalid value 'f:/tmp/test.json' for '--disable-topology
[<DISABLE_TOPOLOGY>]'
[possible values: true, false]
For more information, try '--help'.
```
After:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
14:44:00 [INFO] CPUs: online/possible=176/176 nr_cores=88
14:44:00 [INFO] Disabling topology awareness
...
^CEXIT: Scheduler unregistered from user space
```
Add an additional layer growth algorithm, named 'RandomTopo'. It follows these
rules:
- Randomise NUMA nodes. List each core in each NUMA node before a core from
another NUMA node.
- Randomise LLCs within each NUMA node. List each core in each LLC before a
core in a different LLC.
- Randomise the core order within each LLC.
This attempts to provide a relatively evenly distributed set of cores while
considering topology. Unlike `Topo`, it does not require you to specify the
ordering and instead generates it from the hardware, making desyncs between the
config and the hardware less likely.
Currently `RandomTopo` considers topology even with `--disable-topology=true`.
I can see the arguments for this going both ways. On one hand requesting
disable topology suggests you want no consideration of machine topology, and
`RandomTopo` should decay to `Random` (which it does on single node/LLC machines
anyway). On the other hand, the config explicitly specifies `RandomTopo` and
should consider the topology. If anyone feels strongly I can change this to
respect `disable_topology`.
Test plan:
```sh
$ sudo target/release/scx_layered -v f:/tmp/test.json
...
14:31:19 [DEBUG] layer: batch algo: RandomTopo core order: [47, 44, 43, 42, 40, 45, 46, 41, 38, 37, 36, 39, 34, 32, 35, 33, 54, 49, 50, 52, 51, 48, 55, 53, 68, 64, 66, 67, 70, 69, 71, 65, 9, 10, 12, 15, 14, 11, 8, 13, 59, 60, 57, 63, 62, 56, 58, 61, 2, 3, 5, 4, 0, 6, 7, 1, 86, 83, 85, 87, 84, 81, 80, 82, 20, 22, 19, 23, 21, 18, 17, 16, 30, 25, 26, 31, 28, 27, 29, 24, 78, 73, 74, 79, 75, 77, 76, 72]
14:31:19 [DEBUG] layer: immediate algo: RandomTopo core order: [45, 40, 46, 42, 47, 43, 41, 44, 80, 82, 83, 84, 85, 86, 81, 87, 13, 10, 9, 15, 14, 12, 11, 8, 36, 38, 39, 32, 34, 35, 33, 37, 7, 3, 1, 0, 2, 5, 4, 6, 53, 52, 54, 48, 50, 49, 55, 51, 76, 77, 79, 78, 73, 74, 72, 75, 71, 66, 64, 67, 70, 69, 65, 68, 24, 26, 31, 25, 28, 30, 27, 29, 58, 56, 59, 61, 57, 62, 60, 63, 16, 19, 17, 23, 22, 20, 18, 21]
...
```
This is a machine with 1 NUMA/11 LLCs with 8 cores per LLC and you can see the
results are grouped by LLC but random within.
Make scx_rlfifo even simpler and keep dispatching tasks even if the CPUs
are all busy.
This allows to better stress test the scx_rustland_core backend, by
using both the per-CPU DSQs and the global shared DSQ.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
scx_rustland is now effectively a deadline-based scheduler and not a
pure vruntime-based scheduler.
Clarify this in the source code. No functional change.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the nvcsw metric from the scx_rustland_core backend, intead of
retrieving this metric in user-space via procfs.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
With user-space scheduling we don't usually dispatch a task immediately
after selecting an idle CPU, so there's not much benefit at trying to
optimize the WAKE_SYNC scenario (when a task is waking up another task
and releaing the CPU) when picking an idle CPU.
Therefore, get rid of the WAKE_SYNC logic in select_cpu() and rely on
the user-space logic (that has access to the WAKE_SYNC information) to
handle this particular case.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Do not kick a CPU from rs_select_cpu() (called by the user-space
scheduler), since we may not immediately dispatch the task.
Instead, always try to wake up the task's assigned CPU after dispatching
to a global DSQ, ensuring it can be consumed immediately.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Prevent CPUs from going idle when the user-space scheduler has some
pending activities to complete.
Keeping the CPU alive allows to consume tasks from the user-space
scheduler more efficiently, preventing bubbles in the scheduling
pipeline.
To achieve this, trigger a CPU kick from ops.update_idle() and set a
flag in the CPU context to prevent it from going idle. Then keep kicking
the CPU from ops.dispatch() until the flag is cleared, which occurs when
no more tasks are pending or when the CPU exits idle as a task starts
running on it.
This allows to fix the performance regression introduced by the
put_prev_task_scx() behavior change in Linux 6.12 (see #788).
Link: https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>