Commit Graph

1051 Commits

Author SHA1 Message Date
Andrea Righi
a7965abdbc scx_utils: clarify error about missing CONFIG_DEBUG_INFO_BTF
If CONFIG_DEBUG_INFO_BTF is not enabled in the kernel, the C schedulers
report the following error via libbpf, clearly indicating the missing
kernel config:

 libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled?

In contrast, the Rust schedulers report a less clear error:

 thread 'main' panicked at /home/arighi/src/scx/rust/scx_utils/src/compat.rs:23:9:
 btf__load_vmlinux_btf() returned NULL
 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Make sure to report a similar error, so that users have a better clue
about the missing kernel config. After this change the error looks like
the following:

 thread 'main' panicked at /home/arighi/src/scx/rust/scx_utils/src/compat.rs:23:9:
 btf__load_vmlinux_btf() returned NULL, was CONFIG_DEBUG_INFO_BTF enabled?

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 09:15:43 +02:00
David Vernet
8537c1b474
Merge pull request #393 from sched-ext/revert-382-rusty_refactor
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load"
2024-06-26 16:36:11 -05:00
David Vernet
fe3ce64a9b
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load" 2024-06-26 17:35:22 -04:00
Changwoo Min
41d60aef04
Merge pull request #391 from multics69/lavd-tuning-v4
scx_lavd: tweaks to avoid fork starvation
2024-06-27 00:21:48 +09:00
Andrea Righi
d26a76f238
Merge pull request #390 from sirlucjan/scx-update2
Revert "Add After=graphical.target into service"
2024-06-26 12:11:49 +02:00
Piotr Gorski
1659152a62
Revert "Add After=graphical.target into service"
This reverts commit f7e575808b.

Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-26 12:08:21 +02:00
Changwoo Min
abc6e31fef scx_lavd: for a forked task, inherit its parent's statistics
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 19:00:10 +09:00
Changwoo Min
ac9c49f5b5 scx_lavd: loosen the deadline when overloaded
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 15:06:31 +09:00
Changwoo Min
b32734168b scx_lavd: print build ID when lavd is loaded
When the lavd is loaded, it prints out its build id. It helps to easily
identify what version it is when testing.

```
01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu)
```

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 10:57:19 +09:00
Dan Schatzberg
d349f86d04 mitosis: Update synchronization
The synchronization for mitosis is a bit ad-hoc, working around lack of
atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and
compiler barriers to get the behaviors we want.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-25 12:44:16 -07:00
David Vernet
98a5fa8595
Merge pull request #371 from sched-ext/build_id
Add build-id to build process
2024-06-25 11:45:32 -05:00
David Vernet
d42bae4fcf
rusty: Print build ID when rusty is loaded
When someone is testing schedulers, we often have to ask what version
the scheduler is running as. Now that we can access the build ID from
rust schedulers, let's update scx_rusty to print the build ID when rusty
first starts running.

This results in output such as the following:

```
[void@maniforge scx]$ rusty
19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug)
19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111
19:04:26 [INFO]   DOM[00] mask= 0b00000000111111110000000011111111
19:04:26 [INFO]   DOM[01] mask= 0b11111111000000001111111100000000
19:04:26 [INFO] Rusty scheduler started!
```

Signed-off-by: David Vernet <void@manifault.com>
2024-06-25 11:44:46 -05:00
David Vernet
2aa8bbc32d
utils: Export build ID values from rust scx_utils
We want schedulers to be able to print, log, etc the build ID of the
repository. To do this, we can use the vergen Cargo crate to generate
environment variables that contain values that we can export from scx_utils.

This patch update scx_utils accordingly to use vergen to generate build
ID output that can be printed from schedulers. A subsequent patch will
update scx_rusty to print this build ID value.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-25 11:44:19 -05:00
David Vernet
9d9ece11aa
Merge pull request #384 from jfernandez/log-recorder
scx_utils: Add log_recorder module for metrics-rs
2024-06-25 11:43:37 -05:00
David Vernet
e60c5c024b
Merge pull request #387 from sirlucjan/scx-update
Add After=graphical.target into service
2024-06-25 11:10:38 -05:00
Andrea Righi
39240a27ce
Merge pull request #380 from sched-ext/rustland-core-smooth-perf
scx_rustland_core: smooth performance
2024-06-25 14:52:58 +02:00
Andrea Righi
5db0908530 scx_rustland_core: make sure to use a valid CPU during direct dispatch
We may end up selecting an invalid CPU (according to the task's cpumask)
when dispatching the task via dispatch_direct_cpu().

When this happens simply return an error and do not dispatch the task
and let the caller handle the error: in the context of select_cpu() we
can simply ignore the dispatch and return the target CPU; in the context
of FIFO mode dispatch we can fallback to SCX_DSQ_LOCAL if the target CPU
is not valid.

This fixes issue #353.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:11:46 +02:00
Andrea Righi
e4b13b2aa6 scx_rustland_core: reduce dispatch overhead
Kick CPUs in the dispatch path only when needed (typically when tasks
are bounced to other CPUs).

Moreover, avoid to consume all the tasks dispatched at once.

This seems to reduce the BPF overhead (according to bpftop), going from
~10% CPU usage down to ~6% CPU usage of rustland_dispatch() on an over
commissioned system, without introducing any measureable performance
regression.

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
631d5576dc scx_rustland_core: refactor CPU selection logic
Allow to dispatch tasks directly (bypassing the user-space scheduler)
only when the scheduler is operating in FIFO mode.

On an over-commissioned system, directly dispatching tasks can only
increase OS noise. These tasks can get a brief priority boost and an
extended time slice just because they found an idle CPU, which can lead
to erratic behavior.

This is particularly problematic when measuring performance stability,
such as evaluating the frames-per-second (fps) of a video game on an
overloaded system.

In such cases, it's better to bounce all tasks to the user-space
scheduler, that will ensure a better level of fairness and smoother
performance.

Moreover, get rid of the second chance dispatch logic introduced in
commit 4791d862 ("scx_rustland_core: second chance CPU migration"). This
seems to provide benefits only on certain architectures (Intel), but it
can introduces lags in others (AMD).

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
19217b5722 scx_rustland_core: clarify comment about resuming FIFO mode
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
081d4bdb86 scx_rustland_core: add debugging to dispatch_direct_cpu()
Report dispatch_direct_cpu() events in the trace, like any other
dispatch-related event.

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Piotr Gorski
f7e575808b
Add After=graphical.target into service
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-25 11:05:37 +02:00
Changwoo Min
11f685f7ec
Merge pull request #386 from multics69/lavd-tuning-v3
scx_lavd: revising tunables to reduce micro-stutters
2024-06-25 17:09:30 +09:00
Changwoo Min
5d0db5c5fe scx_lavd: revising tunables to reduce micro-stutters
This is a second attempt to optimize tunables for a wider range of
games.

1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range).
   Now the latency priority (biased by nice value) will decide which
   task should run first . The nice value will decide the time slice.

2) The first change will give higher priority to latency-critical task
   compared to before. For compensation, the slice boost also increased
   (2x -> 3x).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-25 16:13:32 +09:00
Jose Fernandez
e5984ed016
scx_utils: Add log_recorder module for metrics-rs
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.

scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.

Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.

Counters:
  dispatched_tasks_total: 65559 [1344.8/s]
    prev_idle: 44963 (68.6%) [966.5/s]
    wsync_prev_idle: 15696 (23.9%) [317.3/s]
    direct_dispatch: 2833 (4.3%) [35.3/s]
    dsq: 1804 (2.8%) [21.3/s]
    wsync: 262 (0.4%) [4.3/s]
    direct_greedy: 1 (0.0%) [0.0/s]
    pinned: 0 (0.0%) [0.0/s]
    greedy_idle: 0 (0.0%) [0.0/s]
    greedy_xnuma: 0 (0.0%) [0.0/s]
    direct_greedy_far: 0 (0.0%) [0.0/s]
    greedy_local: 0 (0.0%) [0.0/s]
  dl_clamped_total: 1290 [20.3/s]
  dl_preset_total: 514 [1.0/s]
  kick_greedy_total: 6 [0.3/s]
  lb_data_errors_total: 0 [0.0/s]
  load_balance_total: 0 [0.0/s]
  repatriate_total: 0 [0.0/s]
  task_errors_total: 0 [0.0/s]

Gauges will show the last set value:

Gauges:
  slice_length_us: 20000.00

Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.

Histograms:
  cpu_busy_pct: avg=1.66 min=1.16 max=2.16
  load_avg node=0: avg=0.31 min=0.23 max=0.39
  load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
  processing_duration_us: avg=297.50 min=296.00 max=299.00

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-24 18:45:02 -06:00
Changwoo Min
71425af06d
Merge pull request #385 from multics69/link-blog2
README: add a link to Changwoo's blog post on scx (part 2)
2024-06-25 09:12:01 +09:00
Changwoo Min
2e35bca24c README: add a link to Changwoo's blog post on scx (part 2)
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-25 09:10:24 +09:00
David Vernet
8059acb634
Merge pull request #381 from vax-r/rusty_dom_load_status_check
scx_rusty: Pull domain status check
2024-06-24 17:54:54 -05:00
David Vernet
55ee210d42
Merge pull request #382 from vax-r/rusty_refactor
scx_rusty: Refactor ridx assignment in populate_tasks_by_load
2024-06-24 17:47:55 -05:00
Changwoo Min
45b2c3d9fe
Merge pull request #383 from multics69/lavd-param-tuning
scx_lavd: revising tunables for less-preemptive games
2024-06-24 09:02:20 +09:00
Changwoo Min
016229cbcf scx_lavd: revising tunables for less-preemptive games
In some games (e.g., Elden Ring), it was observed that preemption
happens much less frequently. The reason is that tasks' runtime per
schedule is similar, so it does not meet the existing criteria. To
alleviate the problem, the following three tunables are revised:

1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to
   trigger more preemption.

2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz
   kernels.

3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-24 00:27:33 +09:00
I Hsin Cheng
eab234a74f scx_rusty: Refactor ridx assignment in populate_tasks_by_load
Origin assignment of the variable ridx is equivalent to comparing
between "ridx" and "wids - MAX_PIDS". Using u64 max library helper
function to perform the comparison and provide better readability.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:58:51 +08:00
I Hsin Cheng
84b9ac4dce scx_rusty: Pull domain status check
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.

Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.

The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:38:23 +08:00
David Vernet
5038f54701
Merge pull request #377 from jfernandez/metrics-rs
rusty: Integrate stats with the metrics framework
2024-06-21 15:23:20 -05:00
David Vernet
9919b71fd4
Merge pull request #379 from sched-ext/topo_nr_cpu_ids
Add topo.nr_cpu_ids() to Topology crate
2024-06-21 13:35:05 -05:00
David Vernet
772ac03311
Merge pull request #375 from sched-ext/htejun/revert-flatcg-refcnt
Revert "scx_flatcg: Keep cgroup rb nodes stashed"
2024-06-21 13:06:02 -05:00
David Vernet
3bd15be840
rlfifo: Use topo.nr_cpu_ids() instead of topo.nr_cpus_possible()
In scx_rlfifo, we're currently using topo.nr_cpus_possible() to
determine how many possible CPU IDs we could have on the system. To
properly support systems whose disabled CPUs may be in the middle of the
range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that
we don't accidentally dispatch to an invalid DSQ.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:20 -05:00
David Vernet
263e02f644
rusty: Use nr_cpu_ids instead of nr_cpus_possible
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:19 -05:00
David Vernet
bdbf4b9c05
topo: Return nr_cpu_ids from host Topology
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.

We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:13 -05:00
David Vernet
68116302d8
Merge pull request #378 from sirlucjan/hooks-update
Simplifying pacman-hooks
2024-06-21 12:30:12 -05:00
David Vernet
3219d15e3d
Merge pull request #292 from hodgesds/stress-ng-ci
Add stress-ng to scheduler tests
2024-06-21 11:35:56 -05:00
Jose Fernandez
83373b1f4e
rusty: Integrate stats with the metrics framework
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.

This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.

The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.

A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.

Future changes will migrate additional stats to this framework and add
support for other backends.

[1] https://metrics.rs/

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-21 10:18:44 -06:00
Piotr Gorski
3684b1601c
Simplifying pacman-hooks
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-21 12:18:33 +02:00
Tejun Heo
7a40059b55 Revert "scx_flatcg: Keep cgroup rb nodes stashed"
This reverts commit 3b7f33ea1b.

I haven't root caused it yet but it's easy to reproduce stall and trigger
the watchdog after the commit - just running stress in multiple cgroups
easily triggers stalls after a couple tens of seconds. Let's revert it for
now.
2024-06-19 14:44:26 -10:00
Andrea Righi
92ca7f385c
Merge pull request #374 from sched-ext/rustland-alloc-refactoring
scx_rustland_core: include buddy-alloc and refactor allocator code
2024-06-19 19:15:32 +02:00
Andrea Righi
b04e82b5eb scx_rustland_core: include buddy-alloc and refactor allocator code
The dependency of the buddy-alloc crate [1] seems to cause some troubles
with packaging, mostly because the selftests for the crate are failing
when it's compiled in release mode.

For example:

 $ cargo test --release -- --nocapture
 thread 'tests::fast_alloc::test_basic_malloc' panicked at src/tests/fast_alloc.rs:25:13:
 assertion `left == right` failed
   left: 0
  right: 42

Some of these failures with BuddyAlloc can be fixed by using a memory
arena buffer aligned to page size.

However, some test failures with FastAlloc persist that cannot be
resolved merely by aligning the pre-allocated memory arena to the page
size, as mentioned in [2].

The concern is that this may potentially lead to actual memory bugs.

Therefore, it seems safer to refactor the custom allocator code to
simply use BuddyAlloc, dropping FastAlloc completely.

To achieve this, the entire BuddyAlloc code has been directly included
in scx_rustland_core, referencing the original project and its MIT
licensing information (with the entire code still distributed under the
GPLv2 license).

Then the code has been slightly modified to remove FastAlloc and the
external dependency on the buddy-alloc crate has been dropped.

From a performance perspective this change doesn't seem to introduce any
measurable regression.

[1] https://github.com/jjyr/buddy-alloc
[2] https://github.com/jjyr/buddy-alloc/issues/16

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-19 14:44:04 +02:00
Changwoo Min
9c21ace276
Merge pull request #373 from vax-r/lavd_reuse
scx_lavd: Reuse can_task1_kick_task2
2024-06-19 15:29:05 +09:00
David Vernet
b1b43fdbd8
Merge pull request #372 from vax-r/util_entry
scx_utils: Utilize Entry API for BTreeMap insertion
2024-06-18 22:40:44 -05:00
I Hsin Cheng
99960ad960 scx_lavd: Reuse can_task1_kick_task2
Use the function can_task1_kick_task2() to replace places which also
checking the comp_preemption_info between two cpus for better
consistency.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-19 11:01:31 +08:00
I Hsin Cheng
1334a4df5d scx_utils: Utilize Entry API for BTreeMap insertion
Take advantages of BTreeMap's Entry API working with or_insert() to do
the conditional insertion. Insert only when the entry doesn't exist.
Doing so can reduce the amount of code and provide better readability
and perform in-place manipulation.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-19 10:27:10 +08:00