This is a second attempt to optimize tunables for a wider range of
games.
1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range).
Now the latency priority (biased by nice value) will decide which
task should run first . The nice value will decide the time slice.
2) The first change will give higher priority to latency-critical task
compared to before. For compensation, the slice boost also increased
(2x -> 3x).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.
scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.
Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.
Counters:
dispatched_tasks_total: 65559 [1344.8/s]
prev_idle: 44963 (68.6%) [966.5/s]
wsync_prev_idle: 15696 (23.9%) [317.3/s]
direct_dispatch: 2833 (4.3%) [35.3/s]
dsq: 1804 (2.8%) [21.3/s]
wsync: 262 (0.4%) [4.3/s]
direct_greedy: 1 (0.0%) [0.0/s]
pinned: 0 (0.0%) [0.0/s]
greedy_idle: 0 (0.0%) [0.0/s]
greedy_xnuma: 0 (0.0%) [0.0/s]
direct_greedy_far: 0 (0.0%) [0.0/s]
greedy_local: 0 (0.0%) [0.0/s]
dl_clamped_total: 1290 [20.3/s]
dl_preset_total: 514 [1.0/s]
kick_greedy_total: 6 [0.3/s]
lb_data_errors_total: 0 [0.0/s]
load_balance_total: 0 [0.0/s]
repatriate_total: 0 [0.0/s]
task_errors_total: 0 [0.0/s]
Gauges will show the last set value:
Gauges:
slice_length_us: 20000.00
Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.
Histograms:
cpu_busy_pct: avg=1.66 min=1.16 max=2.16
load_avg node=0: avg=0.31 min=0.23 max=0.39
load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
processing_duration_us: avg=297.50 min=296.00 max=299.00
Signed-off-by: Jose Fernandez <josef@netflix.com>
In some games (e.g., Elden Ring), it was observed that preemption
happens much less frequently. The reason is that tasks' runtime per
schedule is similar, so it does not meet the existing criteria. To
alleviate the problem, the following three tunables are revised:
1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to
trigger more preemption.
2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz
kernels.
3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Origin assignment of the variable ridx is equivalent to comparing
between "ridx" and "wids - MAX_PIDS". Using u64 max library helper
function to perform the comparison and provide better readability.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.
Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.
The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
In scx_rlfifo, we're currently using topo.nr_cpus_possible() to
determine how many possible CPU IDs we could have on the system. To
properly support systems whose disabled CPUs may be in the middle of the
range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that
we don't accidentally dispatch to an invalid DSQ.
Signed-off-by: David Vernet <void@manifault.com>
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().
Signed-off-by: David Vernet <void@manifault.com>
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.
We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.
Signed-off-by: David Vernet <void@manifault.com>
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.
This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.
The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.
A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.
Future changes will migrate additional stats to this framework and add
support for other backends.
[1] https://metrics.rs/
Signed-off-by: Jose Fernandez <josef@netflix.com>
This reverts commit 3b7f33ea1b.
I haven't root caused it yet but it's easy to reproduce stall and trigger
the watchdog after the commit - just running stress in multiple cgroups
easily triggers stalls after a couple tens of seconds. Let's revert it for
now.
The dependency of the buddy-alloc crate [1] seems to cause some troubles
with packaging, mostly because the selftests for the crate are failing
when it's compiled in release mode.
For example:
$ cargo test --release -- --nocapture
thread 'tests::fast_alloc::test_basic_malloc' panicked at src/tests/fast_alloc.rs:25:13:
assertion `left == right` failed
left: 0
right: 42
Some of these failures with BuddyAlloc can be fixed by using a memory
arena buffer aligned to page size.
However, some test failures with FastAlloc persist that cannot be
resolved merely by aligning the pre-allocated memory arena to the page
size, as mentioned in [2].
The concern is that this may potentially lead to actual memory bugs.
Therefore, it seems safer to refactor the custom allocator code to
simply use BuddyAlloc, dropping FastAlloc completely.
To achieve this, the entire BuddyAlloc code has been directly included
in scx_rustland_core, referencing the original project and its MIT
licensing information (with the entire code still distributed under the
GPLv2 license).
Then the code has been slightly modified to remove FastAlloc and the
external dependency on the buddy-alloc crate has been dropped.
From a performance perspective this change doesn't seem to introduce any
measurable regression.
[1] https://github.com/jjyr/buddy-alloc
[2] https://github.com/jjyr/buddy-alloc/issues/16
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use the function can_task1_kick_task2() to replace places which also
checking the comp_preemption_info between two cpus for better
consistency.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Take advantages of BTreeMap's Entry API working with or_insert() to do
the conditional insertion. Insert only when the entry doesn't exist.
Doing so can reduce the amount of code and provide better readability
and perform in-place manipulation.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
It seems that we are not updating `is_idle` when we find an idle CPU
with pick_cpu(), causing unnecessary rescheduling events when
select_cpu() is called.
To resolve this, ensure that the is_idle state is correctly set.
Additionally, always ensure that the task is dispatched to the local DSQ
immediately upon finding (and reserving) an idle CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
- clean up u63 and u32 usages in structures to reduce struct size
- refactoring pick_cpu() for readability
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The required CPU performance (cpuperf) was set to 1024 (100%) when the
CPU utilization was 100%. When a sudden load spike happens, it makes the
system adapt slowly in the next interval.
The new scheme always reserves some headroom in advance, so it sets
cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some
room to adapt in advance.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Modify the execution sequence before lookup operation for new_domc. If
new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely
going to fail so we don't have to wait until we found that new_domc is
NULL, clearing of cpumask and return operation should be done directly
in that case.
Plus we should avoid using try_lookup_dom_ctx outside the context of
lookup_dom_ctx, as it can keep the interface's consistency.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
__COMPAT_scx_bpf_consume_task() wasn't calling scx_bpf_consume_task() at all
and was always returning false. Fix it.
Also, update scx_qmap usage example so that it matches cgroup ID rather than
comm prefix. This should make testing with multiple processes a bit easier.
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.
Signed-off-by: David Vernet <void@manifault.com>
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:
- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...
Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.
Signed-off-by: David Vernet <void@manifault.com>
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.
To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.
Signed-off-by: David Vernet <void@manifault.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.dump*(). The
open helper macros now check the existence of the fields and abort if
missing.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.tick(). The
open helper macros now check the existence of the field and abort if
missing.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.exit_dump_len.
The open helper macros now check the existence of the field and abort if
missing.