scx_bpfland: enhanced task affinity

Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.

This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.

Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.

= Results =

This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].

Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.

Test:
 - cachyos-benchmarker

System:
 - AMD Ryzen 7 5800X 8-Core Processor

Metrics:
 - total time: elapsed time of all benchmarks
 - total score: geometric mean of all benchmarks

NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.

== Results: summary ==

 +-------------------------+---------------------+---------------------+
 |         Scheduler       |    Total Time       |    Total Score      |
 |                         |    (less = better)  |    (less = better)  |
 +-------------------------+---------------------+---------------------+
 |                 EEVDF   |  624.44 sec         |      123.68         |
 |               bpfland   |  625.34 sec         |      122.21         |
 | bpfland-task-affinity   |  623.67 sec         |      122.27         |
 +-------------------------+---------------------+---------------------+

== Conclusion ==

With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.

[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html

Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This commit is contained in:
Andrea Righi 2024-08-25 15:50:56 +02:00
parent d708939e5a
commit 28cb1ec5cb

View File

@ -509,7 +509,7 @@ static int dispatch_direct_cpu(struct task_struct *p, s32 cpu, u64 enq_flags)
* to handle these mistakes in favor of a more efficient response and a reduced
* scheduling overhead.
*/
static s32 pick_idle_cpu(struct task_struct *p, s32 prev_cpu, u64 wake_flags)
static s32 pick_idle_cpu(struct task_struct *p, s32 prev_cpu)
{
const struct cpumask *online_cpumask, *idle_smtmask, *idle_cpumask;
struct bpf_cpumask *primary, *turbo, *l2_domain, *l3_domain;
@ -718,15 +718,6 @@ retry:
goto retry;
}
/*
* If all the previous attempts have failed, try to use any idle CPU in
* the system.
*/
cpu = bpf_cpumask_any_and_distribute(p->cpus_ptr, idle_cpumask);
if (bpf_cpumask_test_cpu(cpu, online_cpumask) &&
scx_bpf_test_and_clear_cpu_idle(cpu))
goto out_put_cpumask;
/*
* We couldn't find any idle CPU, so simply dispatch the task to the
* first CPU that will become available.
@ -753,7 +744,7 @@ s32 BPF_STRUCT_OPS(bpfland_select_cpu, struct task_struct *p, s32 prev_cpu, u64
{
s32 cpu;
cpu = pick_idle_cpu(p, prev_cpu, wake_flags);
cpu = pick_idle_cpu(p, prev_cpu);
if (cpu >= 0 && !dispatch_direct_cpu(p, cpu, 0)) {
__sync_fetch_and_add(&nr_direct_dispatches, 1);
return cpu;
@ -794,7 +785,9 @@ static void handle_sync_wakeup(struct task_struct *p)
*/
void BPF_STRUCT_OPS(bpfland_enqueue, struct task_struct *p, u64 enq_flags)
{
struct bpf_cpumask *primary;
u64 deadline = task_deadline(p);
s32 cpu, prev_cpu = scx_bpf_task_cpu(p);
/*
* If the system is saturated and we couldn't dispatch directly in
@ -809,13 +802,22 @@ void BPF_STRUCT_OPS(bpfland_enqueue, struct task_struct *p, u64 enq_flags)
* local_kthreads is enabled.
*/
if (local_kthreads && is_kthread(p) && p->nr_cpus_allowed == 1) {
s32 cpu = scx_bpf_task_cpu(p);
if (!dispatch_direct_cpu(p, cpu, enq_flags)) {
if (!dispatch_direct_cpu(p, prev_cpu, enq_flags)) {
__sync_fetch_and_add(&nr_direct_dispatches, 1);
return;
}
}
/*
* Second chance to find an idle CPU and try to contain the task on the
* local CPU / cache / domain.
*/
cpu = pick_idle_cpu(p, prev_cpu);
if (cpu >= 0 && !dispatch_direct_cpu(p, cpu, 0)) {
__sync_fetch_and_add(&nr_direct_dispatches, 1);
return;
}
/*
* Dispatch interactive tasks to the priority DSQ and regular tasks to
* the shared DSQ.
@ -834,6 +836,20 @@ void BPF_STRUCT_OPS(bpfland_enqueue, struct task_struct *p, u64 enq_flags)
deadline, enq_flags);
__sync_fetch_and_add(&nr_shared_dispatches, 1);
}
/*
* If there are idle CPUs in the primary domain that are usable by the
* task, wake them up to see whether they'd be able to steal the just
* queued task.
*/
primary = primary_cpumask;
if (!primary)
return;
if (bpf_cpumask_subset(cast_mask(primary), p->cpus_ptr)) {
cpu = scx_bpf_pick_idle_cpu(cast_mask(primary), 0);
if (cpu >= 0)
scx_bpf_kick_cpu(cpu, 0);
}
}
/*