I think I see PRs being harder to write because all parts of a CI job
are cancelled when one fails.
I think I am also starting to see that we have enough largely disjoint
moving pieces that there will often be one that is failing stress
tests at any time.
Make CI run all stress tests always to address this.
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
If a waker is more latency critical than a wakee, inherit a waker's
latency criticality for the wakee. This allows the wakee to consider the
context of who wakes me up. For now, we limit such inheritance to one
hop and one schedule.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Use the cast_mask helper to clean up some of the bpf cpumask conversion
code for preemption.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add topology aware preemption that begins in the local LLC and attempts
to preempt from cpus nearest in the topology.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Previously, we found a victim from the entire CPUs, which include remote
or non-compatible CPUs. Now we limit our search for victim finding
within a task's compute domain.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add core growth algos for Big/Little core support. The algos allow
layers to grow layers by preferring either big or little cores first.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add extra ordering macros for Core/CPU structs for ease of use with
Rust standard library features. This issue was hit when trying to sort
cores based on the CoreType. See this similar issue for details:
https://github.com/rust-lang/rust/issues/113550
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The usage of cast_mask() within bpfland_enqueue aims to cast the type of
"p->cpus_ptr" from "struct bpf_cpumask *" to "const struct cpumask *".
However, the type of "p->cpus_ptr" is already "const cpumask_t *" aka
"const struct cpumask *", so no conversion is needed.
Passing a value of type "struct cpumask *" into "struct bpf_cpumask *"
also leads to compiling error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Use an "_" variable to access the returned valued of "saturating_sub()"
to mute the compilation warnings.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.
Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Dispatching kthreads via user-space can still lead to deadlocks in
certain cases (for example we can still trigger stalls by running the
fork stressor via stress-ng).
To prevent such stalls simply dispatch kthreads directly from BPF for
now to prevent failures.
In the future we may consider to provide an API to restrict the
selection of tasks directly dispatched (for example passing a mask PF_*
flags to "whitelist" the tasks that are allowed to bypass the user-space
scheduler).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Updating nr_queued in a non-atomic when a queued task is consumed can
lead to underflows. We don't really care about being 100% accurate here,
since nr_queued should be considered more of a statistic than an
accurate value.
Therefore, just accept the fact that nr_queued can be inaccurate and
handle potential underflows.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
If a task that is executing sched_setaffinity() is dispatched on a
per-CPU DSQ it may stall the DSQ completely, since the task won't be
able to be consumed from the corresponding CPU.
This can be easily triggered running the following stress test:
$ stress-ng --aggressive -c (nproc) -f (nproc)
From the stall trace we can see something like the following:
R stress-ng[2648662] -6880ms
scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x5 dsq_vtime=0
cpus=ff
__set_cpus_allowed_ptr+0x1c8/0x260
__sched_setaffinity+0x105/0x1c0
sched_setaffinity+0x1ed/0x2d0
__x64_sys_sched_setaffinity+0xa5/0x100
do_syscall_64+0x82/0x190
entry_SYSCALL_64_after_hwframe+0x76/0x7e
This should probably be addressed in the core sched_ext, but for now
prevent this deadlock by tracking when a task is executing
sched_setaffinity() and automatically bounce those tasks to the shared
DSQ (that can be consumed from any CPU).
This should solve all the recent CI failures with the scx_rustland_core
schedulers.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.
These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.
Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().
This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
On Python versions that perform validation of this line it fails because
of a square bracket mismatch. This is due to the single quotes being
parsed first. Fix by changing the outer string to double quotes.
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.
This is the same logic used in scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).
In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.
This can be improved in the future adding multiple user-configurable
scheduling domains.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>