Use an "_" variable to access the returned valued of "saturating_sub()"
to mute the compilation warnings.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.
Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Dispatching kthreads via user-space can still lead to deadlocks in
certain cases (for example we can still trigger stalls by running the
fork stressor via stress-ng).
To prevent such stalls simply dispatch kthreads directly from BPF for
now to prevent failures.
In the future we may consider to provide an API to restrict the
selection of tasks directly dispatched (for example passing a mask PF_*
flags to "whitelist" the tasks that are allowed to bypass the user-space
scheduler).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Updating nr_queued in a non-atomic when a queued task is consumed can
lead to underflows. We don't really care about being 100% accurate here,
since nr_queued should be considered more of a statistic than an
accurate value.
Therefore, just accept the fact that nr_queued can be inaccurate and
handle potential underflows.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
If a task that is executing sched_setaffinity() is dispatched on a
per-CPU DSQ it may stall the DSQ completely, since the task won't be
able to be consumed from the corresponding CPU.
This can be easily triggered running the following stress test:
$ stress-ng --aggressive -c (nproc) -f (nproc)
From the stall trace we can see something like the following:
R stress-ng[2648662] -6880ms
scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x5 dsq_vtime=0
cpus=ff
__set_cpus_allowed_ptr+0x1c8/0x260
__sched_setaffinity+0x105/0x1c0
sched_setaffinity+0x1ed/0x2d0
__x64_sys_sched_setaffinity+0xa5/0x100
do_syscall_64+0x82/0x190
entry_SYSCALL_64_after_hwframe+0x76/0x7e
This should probably be addressed in the core sched_ext, but for now
prevent this deadlock by tracking when a task is executing
sched_setaffinity() and automatically bounce those tasks to the shared
DSQ (that can be consumed from any CPU).
This should solve all the recent CI failures with the scx_rustland_core
schedulers.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.
These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.
Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().
This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
On Python versions that perform validation of this line it fails because
of a square bracket mismatch. This is due to the single quotes being
parsed first. Fix by changing the outer string to double quotes.
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.
This is the same logic used in scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).
In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.
This can be improved in the future adding multiple user-configurable
scheduling domains.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Initialize the node cpumask, which was previously uninitialized causing
metric calculations to be wrong when attempting to lookup CPUs in the
node cpumask.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.
Test plan:
- CI lint check passes.
The domains are added to the aggregator when load is added (and
duty_cycle is not 0.0f64).
This commit makes sure that all domains are added to the aggregator even
when the calculated duty_cycle is 0.
Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
Pass in the layer spec when determining the layer core growth algo. This
should make it easier to implement layer growth algos that are spec
specific.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.
When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.
To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).
This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>